A robust clustering technique for a Big Data approach: CLARABD for Mixed data types
DOI:
https://doi.org/10.47187/perf.v2i22.68Keywords:
Classification, CLARA, K medoids, mixed data types, R softwareAbstract
When a researcher does not have an a priori knowledge of the configuration of groups in a given data set, the need to perform a classification known as unsupervised classification emerges. In addition, the data set can be mixed (qualitative and/or quantitative data) or presented in large volumes. The kmeans algorithm, for example, does not allow the comparison of mixed data and is limited to a maximum of 65536 objects in the R software. K-medoids, on the other hand, allows the comparison of mixed data but also has the same limitation of objects that k-means does. The traditional CLARA algorithm can easily exceed this volume limitation, but it does not allow the comparison of mixed data. In this context, this work is an extension of the CLARA algorithm for mixed data, the CLARABD algorithm. Gower distance is central in CLARABD to make this ex- tension, because it allows the comparison of mixed data and it is also possible to process a data set with more than 65536 observations. To show the benefits of the proposed algorithm, a simulation process has been carried out as well as an application to real data, obtaining consistent results in each case.
Downloads
References
James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning. Vol. 112. Springer; 2013.
Izenman AJ. Modern multivariate statistical techniques. Regression, classification and manifold learning; 2008.
Godoy S. Evaluacion de algoritmos de clasificacion basada en el modelo estructural de cubrimientos [PhD thesis]. Instituto Politécnico Nacional, México; 2006.
Toledo LB. Procedimiento para evaluar el nivel de complejidad de los procesos de negocio a partir de su representacion grafica [PhD thesis]. Universidad Central "Marta Abreu" de Las Villas; 2014.
Riquenes-Fernandez A, Alba-Cabrera E. Collective classification: An useful alternative for the classification of objects. In: European Congress on Intelligent Techniques and Soft Computing EUFIT. Vol. 97; 1997. p. 1875-9.
Baillo Moreno A, Grané Chávez A. 100 problemas resueltos de estadística multivariante: (implementados en Matlab). Madrid, España: Delta Publicaciones; 2008.
MacQueen J. Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics. Berkeley, Calif.: University of California Press; 1967. p. 281-97. Available from: https://projecteuclid.org/euclid.bsmsp/1200512992.
Ripley BD. Pattern recognition and neural networks. Cambridge: Cambridge University Press; 2007.
Xu R, Wunsch D. Clustering. Vol. 10. John Wiley & Sons; 2008.
Toomey D. R for Data Science. 1st ed. Packt Publishing Ltd; 2014.
Kassambara A. Statistical tools for high-throughput data analysis; 2018. Available from: http://www.sthda.com/english/.
Dheeru D, Taniskidou EK. UCI machine learning repository; 2017. Available from: http://archive.ics.uci.edu/ml.
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.