Аналіз великих даних за допомогою методів редукції моделей

The enormous growth in the size of data has been observed in recent years being a key factor of the Big Data scenario. Big Data require a new high-performance processing. The use of big data preprocessing methods for data mining in big data is reviewed in this paper. The definition, attributes and c...

Full description

Saved in:
Bibliographic Details
Date:2018
Main Author: Zabielin, Stanislav I.
Format: Article
Language:English
Published: The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute" 2018
Subjects:
Online Access:https://journal.iasa.kpi.ua/article/view/109517
Tags: Add Tag
No Tags, Be the first to tag this record!
Journal Title:System research and information technologies
Download file: Pdf

Institution

System research and information technologies
_version_ 1867334318691450880
author Zabielin, Stanislav I.
author_facet Zabielin, Stanislav I.
author_institution_txt_mv [ { "author": "Stanislav I. Zabielin", "institution": "Educational and Scientific Complex \"Institute for Applied System Analysis\" of the National Technical University of Ukraine \"Igor Sikorsky Kyiv Polytechnic Institute\", Kyiv" } ]
author_sort Zabielin, Stanislav I.
baseUrl_str http://journal.iasa.kpi.ua/oai
collection OJS
datestamp_date 2019-01-17T13:29:35Z
description The enormous growth in the size of data has been observed in recent years being a key factor of the Big Data scenario. Big Data require a new high-performance processing. The use of big data preprocessing methods for data mining in big data is reviewed in this paper. The definition, attributes and categorization of data preprocessing approaches in big data are introduced. The relation between big data and data preprocessing throughout all families of methods and advanced data technologies are likewise analyzed. Furthermore, research challenges are discussed, while concentrating on improvements in certain families of data preprocessing methods and applications based on new big data learning paradigms.
doi_str_mv 10.20535/SRIT.2308-8893.2018.2.04
first_indexed 2025-07-17T10:23:02Z
format Article
fulltext  Stanislav Zabielin, 2018 Системні дослідження та інформаційні технології, 2018, № 2 35 УДК 004.8 DOI: 10.20535/SRIT.2308-8893.2018.2.04 BIG DATA ANALYSIS VIA MODEL REDUCTION METHODS STANISLAV ZABIELIN Abstract. The enormous growth in the size of data has been observed in re- cent years being a key factor of the Big Data scenario. Big Data require a new high-performance processing. The use of big data preprocessing meth- ods for data mining in big data is reviewed in this paper. The definition, at- tributes and categorization of data preprocessing approaches in big data are introduced. The relation between big data and data preprocessing through- out all families of methods and advanced data technologies are likewise analyzed. Furthermore, research challenges are discussed, while concentrat- ing on improvements in certain families of data preprocessing methods and applications based on new big data learning paradigms. Keywords: nonlinear mapping, dimension reduction, big data, modelling, non-linear dynamic objects, dimensional reduction, diffusion maps, kernel method of main components INTRODUCTION Vast amounts of raw data are encompassing us in our world, information that can’t be directly treated by humans or manual applications. Technologies as the Internet, engineering and science applications and networks, business services and much more create data in exponential development because of the development of powerful storage and connection tools. Organized knowledge and information can’t be easily obtained because of this enormous data growth and neither one of its can be effectively understood or automatically extracted. These premises have prompted to the development of data science or data mining, a well-known disci- pline which is more and more present in the current world of the Information Age [1]. Nowadays, the current amount of data managed by systems all around the globe have surpassed the processing capacity of more traditional systems, and this applies to data mining as well. The arising of new technologies and services (like Cloud computing [9]) as well as the reduction in hardware price are leading to an ever-growing rate of information on the Internet. This phenomenon certainly represents a “Big” challenge for the data analytics community. Big Data can be thus defined as very high volume, velocity and variety of data that require a new high-performance processing [10]. Distributed computing has been widely used by data scientists before the ad- vent of Big Data phenomenon. Many standard and time-consuming algorithms were replaced by their distributed versions with the aim of agilizing the learning process. However, for most of the current massive problems, a distributed ap- proach becomes mandatory nowadays since no centralized architecture that is able to tackle these huge problems [12]. Stanislav Zabielin ISSN 1681–6048 System Research & Information Technologies, 2018, № 2 36 Nonlinear dimensionality reduction (NLDR) is an attractive topic in many scientific fields. The task of NLDR is to recover the latent low-dimensional struc- tures hidden in high dimensional data. In many areas of artificial intelligence and data mining, the encountered high-dimensional data are intrinsically distributed on a smooth, low-dimensional manifold [8]. The NLDR problem on such data is specifically called “manifold learning” problem [13]. In recent years, there have emerged many manifold learning ap- proaches which are applied to many real-world application problems (e.g., hyper- spectral imaging classification and object tracking), aiming at discovering the in- trinsic geometric representations of the nonlinear data manifolds. Based on the intrinsic construction principles, these approaches can be divided into two catego- ries: global and local approaches. Global approaches, such as Isomap and CDA, attempt to preserve geometry at both local and global scales, essentially construct- ing entire isometric corresponding between all datapairs in the original and latent spaces [14]. Local approaches, such as LLE and Laplacian eigenmaps, attempt to preserve the local geometry of the data, intrinsically keeping invariance between all local areas in the original and latent spaces [13]. PROBLEM Suppose the data consist of N real-valued vectors iX , each of dimensionality D , sampled from some underlying manifold. Provided there is sufficient data (such that the manifold is well-sampled), we expect each data point and its neighbors to lie on or close to a locally linear patch of the manifold. We charac- terize the local geometry of these patches by linear coefficients that reconstruct each data point from its neighbors. Reconstruction errors are measured by the cost function: 2 )(   i j jiji XWXW . Which adds up the squared distances between all the data points and their re- constructions. The weights ijW summarize the contribution of the jth data point to the ith reconstruction. To compute the weights ijW , we minimize the cost func- tion subject to two constraints: first, that each data point iX is reconstructed only from its neighbors, enforcing 0ijW , if jX does not belong to the set of neighbors of iX ;second, that the rows of the weight matrix sum to one:   j ijW 1. Suppose the data lie on or near a smooth nonlinear manifold of lower dimen- sionality Dd  .To a good approximation then, there exists a linear mapping- consisting of a translation, rotation, and rescaling-that maps the high-dimensional coordinates of each neighborhood to global internal coordinates on the manifold. By design, the reconstruction weights ijW reflect intrinsic geometric properties of the data that are invariant to exactly such transformations. We therefore expect their characterization of local geometry in the original data space to be equally valid for local patches on the manifold. In particular, the same weights ijW that Big Data analysis via model reduction methods Системні дослідження та інформаційні технології, 2018, № 2 37 reconstruct the ith data point in D dimensions should also reconstruct its embed- ded manifold coordinates in d dimensions. DISTANCE PRESERVATION This part discusses methods that reduce the dimensionality of data using the dis- tance preservation criterion. Ideally, saving the pairwise distances measured in the dataset ensures that the low-dimensional nesting inherits the basic geometric properties of the data, such as the global shape or local neighborhood relations. Unfortunately, in the nonlinear case the distances cannot be completely preserved. We will discuss various methods that try to overcome this difficulty. These meth- ods use different types of distances; they also rely on various algorithms or opti- mization procedures to determine the nesting [15]. Multidimensional scaling The term multidimensional scaling (MDS) actually hides a family of methods, rather than one clearly defined procedure. Scaling refers to methods that build a configuration of points in the target metric space from information about interme- diate distances, and MDS scales when the target space is Euclidean [7]. The optimization method is exact and purely algebraic: the optimal solution is obtained in a closed form. Metric MDS is also called the spectral method, since the kernel operation in its procedure is the EVD of the Gram matrix. The model is continuous and strictly linear. The mapping is implicit. MDS is simple, durable, but strictly linear. But the metric MDS is flexible: it takes coordinates, as well as scalar products or Euclidean distances. On the other hand, the MDS metric requires memory to store the N-by-N Gram matrix. An- other limitation is the generalization to new data points, which includes an ap- proximate formula for the double centering phase [16]. Sammon’s nonlinear mapping In 1969, Sammon proposed a method for establishing a mapping between a high- order space and a lower one. The model is nonlinear and discrete, creating an ex- plicit mapping [6]. NLM Sammon uses an approximate optimization procedure that can get stuck at a local minimum. The method does not include an estimate of its own dimension; the dimension size is actually fixed by the user. Incremental or layered attachments are not possible: the method must be run separately for each given dimension [11]. Compared to the classical MDS metric, NLM can efficiently handle nonlin- ear varieties, at least if they are not too complex. Among other non-linear versions of the metric MDS, the NLM remains relatively simple and elegant. As a major drawback, NLM does not have the ability to generalize mapping to new points. Just like many other methods of preserving distance, NLM in its original version works with a full distance matrix, therefore, contains )( 2NO records. This can be an obstacle when embedding very large data sets. Another drawback of NLM is its optimization procedure, which can be slow and/or inefficient for Stanislav Zabielin ISSN 1681–6048 System Research & Information Technologies, 2018, № 2 38 some data sets. In particular, the function of the stress of Sammon is not guaran- teed concave; therefore, the optimization process may depend on the local mini- mum. Graph distances. In short, these methods try to overcome some of the short- comings of spatial metrics, such as Euclidean distance. The next subsection pre- sents both geodetic and graphical distances, explains how they relate to one an- other, and motivates their use in the context of diminishing dimensionality. The following subsections describe methods for reducing the non-linear dimension, which use graph distances. Isomap Isomap is the simplest method of NLDR, which uses the distance of the graph as an approximation of the geodetic distance. Isomap is an autonomous batch method that works with precise algebraic optimization. Since Isomap operates as a metric MDS, decomposing the Gram matrix into eigenvalues and eigenvectors is often qualified as a spectral method. In the literature, Isomap is described with- out preliminary processing of data, such as vector quantization [5]. The isomap relies on a nonlinear model. In fact, if the Euclidean distance can be considered as a "linear" metric, then the Isomap's ability to embed nonlinear varieties arises only because of the use of the distance of the graph [17]; Other parts of the method, such as the basic model of the optimization procedure, are based on the classical metric MDS and remain purely linear. Therefore, the Isomap data model is hybrid: the geodetic distance approximation along the face distances is discrete, whereas the subsequent MDS-like step can be considered continuous [18]. Isomap extends metric MDS in a very elegant way. However, the data model of Isomap, which relies on developable manifolds, still remains too rigid. Indeed, when the manifold to be embedded is not developable, Isomap yields disappoint- ing results. In this case, the guarantee of determining a global error does not really matter. Another problem encountered when running Isomap is the practical compu- tation of the geodesic distances. The approximations given by the graph distances may be very rough, and their quality depends on both the data (number of points, noise) and the method parameters. TOPOLOGY PRESERVATION This part discusses methods that reduce dimensionality while preserving the data topology, rather than their pairwise distances. Preservation of topology seems more powerful, but more difficult to implement than keeping distance. The meth- ods described are divided into two classes, depending on the type of topology used. The simplest methods are based on a predefined topology, whereas more modern methods prefer a topology built in accordance with a set of data that will be re-embedded. Self-Organizing Maps Along with multilayer perceptron (MLP), the self-organizing map is perhaps the most widely known method in the field of artificial neural networks. The SOM is Big Data analysis via model reduction methods Системні дослідження та інформаційні технології, 2018, № 2 39 basically a vector quantization method. This means that vector quantization is mandatory in the SOM. As for the reduction in dimension, SOM models the data in a nonlinear and discrete way, representing it with a deformed lattice [4]. SOM is a neural network with learning without a teacher, performing the tasks of visualization and clustering. The idea of the network was proposed by the Finnish scientist T. Kohonen. Most of the time, SOMs are implemented by standalone algorithms, similar to the Robbins-Monro procedure. Online versions can be easily obtained. There is a so-called "batch" version of SOM: instead of updating the prototypes, one after another, they all move simultaneously at the end of each epoch, as in the standard gradient descents [19]. The wide success of the SOM can be explained by the following advantages. The method is very simple from an algorithmic point of view, and its main idea, once understood, is intuitively attractive. Zombies are reliable enough and work very well in many situations, such as visualization of tagged data. Nevertheless, the SOM have some known shortcomings, especially when they are used to reduce the dimension. Most implementations process only one or two-dimensional lattices. Vector quantization is mandatory, which means that the SOM does not really embed data points: base coordinates are calculated only for prototypes. Moreover, the form of the embedding is identical to the lattice, which, in turn, is determined in advance, arbitrarily. This means that the SOM cannot capture the shape of the data cloud in a low-dimensional attachment. From a computational point of view, it is very difficult to assess the convergence of the SOM, since an explicit objective function or an error criterion has been optimized [20]. In fact, it is proved that such a criterion cannot be determined, with the ex- ception of some very special cases. Generative Topographic Mapping The generic topographic mapping (GTM) was put forward by Bishop, Swensen and Williams as a fundamental alternative to the SOM. In fact, GTM is a specific density network based on generative modeling, as indicated by its name. [3] The essential difference between GTM and almost all the other methods is that the GTM is based on the Bayesian learning principle [21]. This probabilistic approach leads to another optimization method: instead of using (stochastic) gra- dient descent or spectral decomposition, the EM algorithm is used. As described above, GTM is a batch method, but there is also a version that works with the sto- chastic EM procedure. Since GTM defines the parameters of the generating model of the data, the diminution of the dimension is easily generalized to new points. Therefore, we can say that GTM defines an implicit mapping, although the hidden space is discrete. If the implementation does not impose a two-dimensional grid, an external procedure is needed to evaluate the internal dimension of the data to determine the correct measurement for the hidden space. Compared to SOM, GTM provides a generative data model. Moreover, the probabilistic approach that has come has a number of advantages. First, apart from finding the hidden coordinates x of the point y, GTM can also approximate p Stanislav Zabielin ISSN 1681–6048 System Research & Information Technologies, 2018, № 2 40 (x | y), that is, the probability that the attachment will be located in x coordinates in a hidden space. This allows us to identify problems in diminishing dimension- ality when, for example, the probability distribution is not unimodal [3]. Secondly, from an algorithmic point of view, GTM optimizes a clearly de- fined objective function. GTM optimizes the probability of using the EM algo- rithm. In comparison with these classical optimization methods, EM is guaranteed to maximize the probability monotonically and converge to a maximum after sev- eral dozen iterations [22]. CONCLUSIONS Dimension reduction often plays an important role in the analysis, interpretation and understanding of numerical data. In practice, reducing the dimension can help to extract some information from arrays of numbers that would otherwise remain useless because of their large size. To a certain extent, the goal is to improve the readability of the data. This can be achieved by visualizing the data in diagrams, diagrams, graphs and other graphical representations. An important motivation (NL) DR is the prevention of its harmful conse- quences. Paradoxically, however, many NLDR methods do not completely solve the problem, but only shy away from it. Many NLDR methods give poor results when the internal dimension of the underlying variety exceeds four or five. In such cases, the size of the embedding space becomes high enough to observe un- desirable effects associated with the curse of dimension, for example, the phe- nomenon of empty space. The future will show whether new methods can solve this problem. REFERENCES 1. Big Data prediction for 2013. Blog by Mike Gualtieri. (n.d.) — Available at: http://blogs.forrester.com/mike_gualtieri 2. Big Data prediction for 2013. Blog by Mike Gualtieri. (n.d.) — Available at: http://blogs.forrester.com/mike_gualtieri 3. Horvath D. Generative Topographic Mapping of Conformational Space / D. Horvath, I. Baskin, G. Marcou, A. Varnek // Molecular Informatics. — 2017. — 36 (10). — P. 22. 4. Kohonen T. Essentials of the self-organizing map / T. Kohonen // Neural Networks. — 2013. — N 37. — P. 52–65. 5. Wang L. The Isomap Algorithm and Topological Stability / L. Wang // Science. — 2002. — 295 (5552). — P. 81. 6. Lerner B. On pattern classification with Sammons nonlinear mapping an experimen- tal study / B. Lerner // Pattern Recognition. — 1998. — 31(4). — P. 371–381. 7. Young F. Multidimensional Scaling: History, Theory, and Applications / B. Lerner // Psychology Press. — 2017. — N 11. — P. 13. 8. Lee J. Nonlinear dimensionality reduction / J. Lee, M. Verleysen // NY: Springer. — 2010. — 29. — P. 110. 9. Marinescu D. Cloud Computing: Theory and Practice / D. Marinescu // Elsevier Sci- ence & Technology Books. — 2017. — 2. — P. 66. Big Data analysis via model reduction methods Системні дослідження та інформаційні технології, 2018, № 2 41 10. Beyond the hype: Big data concepts, methods, and analytics. Egyptian Journal of Medical Human Genetics. Available at: https://www.sciencedirect.com/science/ article/pii/ S0268401214001066 11. Ewing R. Visualization of expression clusters using Sammons non-linear mapping / R. Ewing R., J. Cherry // Bioinformatics. — 2001. — 17(7). — P. 658–659. 12. Dinh H. A survey of mobile cloud computing: architecture, applications, and ap- proaches / H. Dinh // Wireless Communications and Mobile Computing. — 2011. — 13(18). — P. 1587–1611. 13. Wang Q. Combining local and global information for nonlinear dimensionality reduction / Q. Wang, J. Li // Neurocomputing. — 2009. — 72(10–12). — P. 2235–2241. 14. You S. Think locally, fit globally: Robust and fast 3D shape matching via adaptive algebraic fitting / S. You, D. Zhang // Neurocomputing. — 2017. — N 89. — P. 119–129. 15. Lee J. Nonlinear projection with curvilinear distances: Isomap versus curvilinear distance analysis / J. Lee J., A. Lendasse, M. Verleysen // Neurocomputing. — 2004. — N 57. — P. 49–76. 16. Cox T. Multidimensional scaling / T. Cox, M. Cox // Boca Raton. — 2001. — 11. — P. 22. 17. Law M. Incremental nonlinear dimensionality reduction by manifold learning / M. Law, A. Jain // IEEE Transactions on Pattern Analysis and Machine Intelli- gence. — 2006. — 28(3). — P. 377–391. 18. Lee T. Improved criteria for sampled-data synchronization of chaotic Lur’e systems using two new approaches / T. Lee, J. Park // Nonlinear Analysis: Hybrid Sys- tems. — 2017. — 24. — P. 132–145. 19. Du K. Clustering: A neural network approach / K. Du // Neural Networks. — 2010. — 23(1). — P. 89–107. 20. Wang L. Local Dynamic Modeling with SelfOrganizing Maps and Applications to Nonlinear System Identification and Control / L. Wang // Intelligent Signal Proc- essing. — 2009. — 15. — P. 21. 21. Svensen J. GTM: the Generative Topographic Mapping / J. Svensen // University of Aston in Birmingham. — 1998. — 12. — P. 981. 22. Ghahramani Z. Unsupervised Learning / Z. Ghahramani // Advanced Lectures on Machine Learning Lecture Notes in Computer Science. — 2004. — 15. — P. 72–112. Recieved 08.11.2017 From the Editorial Board: the article corresponds completely to submitted manuscript.
id journaliasakpiua-article-109517
institution System research and information technologies
keywords_txt_mv keywords
language English
last_indexed 2025-07-17T10:23:02Z
publishDate 2018
publisher The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute"
record_format ojs
resource_txt_mv journaliasakpiua/3a/989605caff7734611bfb6fa9a251eb3a.pdf
spelling journaliasakpiua-article-1095172019-01-17T13:29:35Z Big Data analysis via model reduction methods Анализ больших данных с помощью методов редукции моделей Аналіз великих даних за допомогою методів редукції моделей Zabielin, Stanislav I. nonlinear mapping dimension reduction big data modelling non-linear dynamic objects diffusion maps kernel method of main components нелинейное отображение уменьшение размерности большие данные моделирование нелинейные динамические объекты диффузионные карты ядерный метод основных компонент нелінійне відображення зменшення розмірності великі дані моделювання нелінійні динамічні об'єкти дифузійні карти ядерний метод основних компонент The enormous growth in the size of data has been observed in recent years being a key factor of the Big Data scenario. Big Data require a new high-performance processing. The use of big data preprocessing methods for data mining in big data is reviewed in this paper. The definition, attributes and categorization of data preprocessing approaches in big data are introduced. The relation between big data and data preprocessing throughout all families of methods and advanced data technologies are likewise analyzed. Furthermore, research challenges are discussed, while concentrating on improvements in certain families of data preprocessing methods and applications based on new big data learning paradigms. Стремительное увеличение объема даннях свидетельствует в пользу больших данных, требующих новой высокопроизводительной обработки. Рассмотрено использование методов предварительной обработки больших даннях, введены определения, атрибуты и категоризация подпрограмм предварительной обработки данных в больших данных. Проанализирована взаимосвязь между большими данными и предварительной обработкой данных во всех существующих методах и больших технологиях данных. Раскрыты проблемы исследований, в которых основное внимание уделяется усовершенствованиям в некоторых группах методов и приложений предварительной обработки данных, основанных на новых парадигмах обучения больших данных. Стрімке збільшення обсягу даних свідчить на користь великих даних, які потребують нового високопродуктивного оброблення. Розглянуто використання методів попереднього оброблення великих даних, уведено визначення, атрибути і категоризацію підпрограм попереднього оброблення даних у великих даних. Проаналізовано взаємозв'язок між великими даними і попереднім обробленням даних у всіх наявних методах і високих технологіях даних. Розкрито проблеми досліджень, основну увагу в яких приділено удосконаленню в деяких групах методів і додатків попереднього оброблення даних, заснованих на нових парадигмах навчання великих даних. The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute" 2018-06-20 Article Article application/pdf https://journal.iasa.kpi.ua/article/view/109517 10.20535/SRIT.2308-8893.2018.2.04 System research and information technologies; No. 2 (2018); 35-41 Системные исследования и информационные технологии; № 2 (2018); 35-41 Системні дослідження та інформаційні технології; № 2 (2018); 35-41 2308-8893 1681-6048 en https://journal.iasa.kpi.ua/article/view/109517/136748 Copyright (c) 2021 System research and information technologies
spellingShingle нелінійне відображення
зменшення розмірності
великі дані
моделювання
нелінійні динамічні об'єкти
дифузійні карти
ядерний метод основних компонент
Zabielin, Stanislav I.
Аналіз великих даних за допомогою методів редукції моделей
title Аналіз великих даних за допомогою методів редукції моделей
title_alt Big Data analysis via model reduction methods
Анализ больших данных с помощью методов редукции моделей
title_full Аналіз великих даних за допомогою методів редукції моделей
title_fullStr Аналіз великих даних за допомогою методів редукції моделей
title_full_unstemmed Аналіз великих даних за допомогою методів редукції моделей
title_short Аналіз великих даних за допомогою методів редукції моделей
title_sort аналіз великих даних за допомогою методів редукції моделей
topic нелінійне відображення
зменшення розмірності
великі дані
моделювання
нелінійні динамічні об'єкти
дифузійні карти
ядерний метод основних компонент
topic_facet nonlinear mapping
dimension reduction
big data
modelling
non-linear dynamic objects
diffusion maps
kernel method of main components
нелинейное отображение
уменьшение размерности
большие данные
моделирование
нелинейные динамические объекты
диффузионные карты
ядерный метод основных компонент
нелінійне відображення
зменшення розмірності
великі дані
моделювання
нелінійні динамічні об'єкти
дифузійні карти
ядерний метод основних компонент
url https://journal.iasa.kpi.ua/article/view/109517
work_keys_str_mv AT zabielinstanislavi bigdataanalysisviamodelreductionmethods
AT zabielinstanislavi analizbolʹšihdannyhspomoŝʹûmetodovredukciimodelej
AT zabielinstanislavi analízvelikihdanihzadopomogoûmetodívredukcíímodelej