ARTICLE | doi:10.20944/preprints201706.0115.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: big data,；Hadoop； visualization； model
Online: 26 June 2017 (06:07:51 CEST)
In era of ever-expanding data and knowledge, we lack a centralized system that maps all the faculties to their research works. This problem has not been addressed in the past and it becomes challenging for students to connect with the right faculty of their domain. Since we have so many colleges and faculties this lies in the category of big data problem. In this paper, we present a model which works on the distributed computing environment to tackle big data. The proposed model uses apache spark as an execution engine and hive as database. The results are visualized with the help of Tableau that is connected to Apache Hive to achieve distributed computing.
ARTICLE | doi:10.20944/preprints201904.0281.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Cluster computing, Big Data, Spark, Hadoop.
Online: 25 April 2019 (11:22:27 CEST)
The article provides detailed information about the new technologies based on cluster computing Hadoop and Apache Spark. The experimental task of processing logistic regression with the help of these technologies is considered. The findings on the comparison of the performance of cluster computing of Hadoop and Apache Spark are revealed and substantiated.
ARTICLE | doi:10.20944/preprints201810.0618.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: classification; machine learning; chaos-based cryptography; Hadoop; data clustering; biometrics
Online: 26 October 2018 (05:50:53 CEST)
Authentication systems based on biometrics characteristics and data represents one of the most important trend in the evolution of our world. In the near future, biometrics systems will be everywhere in the society, such as government, education, smart cities, banks etc. Due to its uniqueness characteristic, biometrics systems will become also vulnerable, privacy being one of the most important challenge. The classic cryptographic primitives are not sufficient to assure a strong level of secureness for privacy. The following work paper represents an effort to present the main cryptographic techniques and algorithms that can give us the possibility to raise a certain level of secureness for privacy. We will show their own challenges (strengths and weaknesses). We will demonstrate how we can use the most common and well-known techniques and algorithms in order to get a maximum efficiency and a high level in assuring the integrity of the biometrics data.
REVIEW | doi:10.20944/preprints202211.0161.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: High Performance Computing (HPC); big data; High Performance Data Analytics (HPDS); con-vergence; data locality; spark; Hadoop; design patterns; process mapping; in-situ data analysis
Online: 9 November 2022 (01:38:34 CET)
Big data has revolutionised science and technology leading to the transformation of our societies. High Performance Computing (HPC) provides the necessary computational power for big data analysis using artificial intelligence and methods. Traditionally HPC and big data had focused on different problem domains and had grown into two different ecosystems. Efforts have been underway for the last few years on bringing the best of both paradigms into HPC and big converged architectures. Designing HPC and big data converged systems is a hard task requiring careful placement of data, analytics, and other computational tasks such that the desired performance is achieved with the least amount of resources. Energy efficiency has become the biggest hurdle in the realisation of HPC, big data, and converged systems capable of delivering exascale and beyond performance. Data locality is a key parameter of HPDA system design as moving even a byte costs heavily both in time and energy with an increase in the size of the system. Performance in terms of time and energy are the most important factors for users, particularly energy, due to it being the major hurdle in high performance system design and the increasing focus on green energy systems due to environmental sustainability. Data locality is a broad term that encapsulates different aspects including bringing computations to data, minimizing data movement by efficient exploitation of cache hierarchies, reducing intra- and inter-node communications, locality-aware process and thread mapping, and in-situ and in-transit data analysis. This paper provides an extensive review of the cutting-edge on data locality in HPC, big data, and converged systems. We review the literature on data locality in HPC, big data, and converged environments and discuss challenges, opportunities, and future directions. Subsequently, using the knowledge gained from this extensive review, we propose a system architecture for future HPC and big data converged systems. To the best of our knowledge, there is no such review on data locality in converged HPC and big data systems.