ARTICLE | doi:10.20944/preprints201904.0281.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Cluster computing, Big Data, Spark, Hadoop.
Online: 25 April 2019 (11:22:27 CEST)
The article provides detailed information about the new technologies based on cluster computing Hadoop and Apache Spark. The experimental task of processing logistic regression with the help of these technologies is considered. The findings on the comparison of the performance of cluster computing of Hadoop and Apache Spark are revealed and substantiated.
ARTICLE | doi:10.20944/preprints202205.0334.v1
Online: 24 May 2022 (11:47:39 CEST)
In the past decades, a significant rise in the adoption of streaming applications has changed the decision-making process for the industry and academia sectors. This movement led to the emergence of a plurality of Big Data technologies such as Apache Storm, Spark, Heron, Samza, Flink, and other systems to provide in-memory processing for real-time Big Data analysis at high throughput. Spark Streaming represents one of the most popular open-source implementations which handles an ever-increasing data ingestion and processing by using the Unified Memory Manager to manage memory occupancy between storage and processing regions dynamically, which is the focus of this study. The problem behind memory management for data-intensive stream processing pipelines is that the incoming data is faster than the downstream operators can consume. Consequently, the backpressure of Spark acts in the opposite direction of downstream operators. In such a case, the incoming data overwhelms the memory manager and provokes memory leak issues. As a result, it affects the performance of applications generating, e.g., high latency, low throughput, or even data loss. In such a case, the initial intuition motivating our work is that memory management became the critical factor in keeping processing at scale and system stability of Spark. This work provides a deep dive into Spark backpressure, evaluates its structure, presents the main characteristics to support data-intensive streaming pipelines, and investigates the current in-memory-based performance issues.
ARTICLE | doi:10.20944/preprints201804.0231.v1
Subject: Earth Sciences, Atmospheric Science Keywords: lightning; spark discharges; tortuosity; streamer bursts; leaders
Online: 18 April 2018 (06:21:29 CEST)
Physical reason for the small scale tortuosity observed in sparks and lightning channels is unknown at present. In this paper it is suggested that the small scale tortuosity of the discharge channels is caused by the natural tendency for subsequent leader streamer bursts to avoid each other but at the same time to align as much as possible along the direction of the background electric field. This process will give rise to a discharge channel that re-orients in space during each streamer burst creating the small scale tortuosity.
ARTICLE | doi:10.20944/preprints202109.0275.v1
Online: 16 September 2021 (11:02:38 CEST)
Molecular Dynamics (MD) simulations model motion of molecules in atomistic detail and aid in drug design. While simulations on large systems may require several days to complete, analysis of terabytes of data generated in the process could also be time consuming. Recent studies captured exciting and dramatic drug-receptor interactions under cell-like complex conditions. Such advances make simulations of biomolecular interactions more realistic, insightful, and informative and have potential to make drug design more realistic. However, currently available resources and techniques do not provide, in reasonable time, a comprehensive understanding of events seen in simulations. We demonstrate that big data approach results in significant speedups, and provides rapid insights into simulations performed. Advancing this improvement, we propose a scalable, self-tuning, and responsive framework based on Cloud-infrastructure to accomplish the best possible MD studies with given priorities and within available resources.
ARTICLE | doi:10.20944/preprints202212.0076.v1
Subject: Materials Science, Nanotechnology Keywords: mechanical alloying; titanium carbide; spark plasma sintering; cermets; corrosion
Online: 5 December 2022 (11:30:50 CET)
In order to produce nanostructured Ti0.9Cr0.1C powders, an elemental powder mixture of titanium, chromium, and graphite is milled in this work using a high-energy ball mill for various milling times. Microstructural characteristics such as crystallite size, microstrain, lattice parameter, and dislocation density are determined using X-ray diffraction (XRD). Mechanical alloying successfully produced nanocrystalline (Ti,Cr)C with an average crystallite size of 11 nm. This size of the crystallites is also directly verified using transmission electron microscopy (TEM). Scanning electron microscopy (SEM) was used to investigate the morphology of the samples. The novelty of this work is advancing the scientific understanding of the effect of milling time on the particle size distribution and crystalline structure, and also understanding the effect of the spark plasma sintering on the different properties of the bulks. Densified cermet samples were produced from the nanocrystalline powders, milled for 5, 10 and 20 hours by SPS process at 1800 degrees for 5 min under a pressure of 80 MPa. Phase changes of the produced cermets were examined according to XRD, SEM/EDX analyses. Significant amounts of Cr and Fe elements were detected, especially in the 20 h milled cermet. The bulk forms of the milled powders for 5 and 20 h had a relative density of 98.43 and 98.51 %, respectively. However, 5 h milled cermet had 93.3 HRA because of the more homogeneous distribution of the (Ti,Cr)C phase, the low iron content and high relative density. According to the 0.0011 mm/year corrosion rate, and 371.68 kΩ*cm2 charge transfer resistance obtained from the potentiodynamic polarization and EIS tests, the 20 h cermet was the specimen with the highest corrosion resistance.
Subject: Keywords: ODS steel; mechanical alloying; spark plasma sintering; zirconium; co-precipitation
Online: 17 February 2021 (10:10:06 CET)
Currently, one of the biggest issues when developing an ODS alloy is the competition established between the different oxide precursors during the precipitation of oxides which nature depends on their chemical composition. In the presence of various precursors, usually the one with the highest affinity to oxygen leads to the absence of the other oxides. In this work, a new process to equilibrate the local concentration of species and to decrease the competition among them is explained. A unique compound, containing the diverse oxide precursors as one complex oxide, is introduced in a prealloyed 14Cr Steel powder via mechanical alloying. Thus, generating environments enriched in Y, Ti and Zr which, after consolidation, refine the oxides precipitation improving the thermal stability of the alloy. SPS were used as consolidation technique to guarantee shorter sintering times and to maintain the nanostructure obtained. Mechanical properties were tested by tensile tests and Vickers microhardness.
ARTICLE | doi:10.20944/preprints202007.0450.v1
Subject: Mathematics & Computer Science, Computational Mathematics Keywords: Apache Spark; distributed computing; distributed matrix algebra; deep learning; matrix primitives
Online: 19 July 2020 (21:22:01 CEST)
The new barrier mode in Apache Spark allows embedding distributed deep learning training as a Spark stage to simplify the distributed training workflow. In Spark, a task in a stage doesn’t depend on any other tasks in the same stage, and hence it can be scheduled independently. However, several algorithms require more sophisticated inter-task communications, similar to the MPI paradigm. By combining distributed message passing (using asynchronous network IO), OpenJDK’s new auto-vectorization and Spark’s barrier execution mode, we can add non-map/reduce based algorithms, such as Cannon’s distributed matrix multiplication to Spark. We document an efficient distributed matrix multiplication using Cannon’s algorithm, which improves significantly on the performance of the existing MLlib implementation. Used within a barrier task, the algorithm described herein results in an up to 24% performance increase on a 10,000x10,000 square matrix with a significantly lower memory footprint. Applications of efficient matrix multiplication include, among others, accelerating the training and implementation of deep convolutional neural network based workloads, and thus such efficient algorithms can play a ground-breaking role in faster, more efficient execution of even the most complicated machine learning tasks
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: RDF; semantic web; basic graph pattern; Distributed SPARQL Query Processing; Spark
Online: 24 June 2019 (09:15:24 CEST)
Resource Description Framework(RDF) is a data representation format of the Semantic Web, and its data volume is growing rapidly. Cloud-based systems provide a rich platform for managing RDF data. However, the distributed environment has performance challenges when it is processing with RDF queries that contain multiple join operations, such as network reshuffle, memory overhead. To get over these challenges, this paper proposed a spark-based RDF query architecture, which is based on Semantic Connection Set (SCS). First of all, this spark-based query architecture adopts the mechanism of re-partitioning class data based on vertical partitioning, which can reduce memory overhead and fast index data. Secondly, a method for generating query plans based on semantic connection sets is proposed in this paper. In addition, statistics and broadcast variable optimization strategies are used to reduce shuffling and data communication costs. The experiment of this paper is based on the latest SPARQLGX on the spark platform RDF system, two synthetic benchmarks are used to evaluate the query. The experiment result illustrates that the proposed approach in this paper is more efficient in data search than SPARQLGX.
ARTICLE | doi:10.20944/preprints201705.0077.v1
Subject: Engineering, Automotive Engineering Keywords: spark ignition engine; direct injection; gasoline; butanol; optical investigations; nanoparticle emissions
Online: 9 May 2017 (04:47:26 CEST)
Within the context of ever wider expansion of direct injection in spark ignition engines, this investigation was aimed at improved understanding of the correlation between fuel injection strategy and emission of nanoparticles. Measurements performed on a wall guided engine allowed identifying the mechanisms involved in the formation of carbonaceous structures during combustion and their evolution in the exhaust line. In-cylinder pressure was recorded in combination with cycle-resolved flame imaging, gaseous emissions and particle size distribution. This complete characterization was performed at three injection phasing settings, with butanol and commercial gasoline. Optical accessibility from below the combustion chamber, allowed visualization of diffusive flames induced by fuel deposits; these localized phenomena were correlated to observed changes in engine performance and pollutant species. With gasoline fueling, minor modifications were observed with respect to combustion parameters, when varying the start of injection. The alcohol, on the other hand, featured marked sensitivity to the fuel delivery strategy. Even though the start of injection was varied in a relatively narrow crank angle range during the intake stroke, significant differences were recorded, especially in the values of particle emissions. This was correlated to the fuel jet-wall interactions; the analysis of diffusive flames, their location and size confirmed the importance of liquid film formation in direct injection engines, especially at medium and high load.
ARTICLE | doi:10.20944/preprints202111.0200.v1
Subject: Engineering, Biomedical & Chemical Engineering Keywords: Diabetes; Diagnosis; Machine Learning; Wireless Body Area Networks; Apache Spark; Feature Selection
Online: 10 November 2021 (09:00:39 CET)
Disease-related data and information collected by physicians, patients, and researchers seem insignificant at first glance. Still, the same unorganized data contain valuable information that is often hidden. The task of data mining techniques is to extract patterns to classify the data accurately. One of the various Data mining and its methods have been used often to diagnose various diseases. In this study, a machine learning (ML) technique based on distributed computing in the Apache Spark computing space is used to diagnose diabetics or hidden pattern of the illness to detect the disease using a large dataset in real-time. Implementation results of three ML techniques of Decision Tree (DT) technique or Random Forest (RF) or Support Vector Machine (SVM) in the Apache Spark computing environment using the Scala programming language and WEKA show that RF is more efficient and faster to diagnose diabetes in big data.
ARTICLE | doi:10.20944/preprints201811.0339.v1
Subject: Engineering, Electrical & Electronic Engineering Keywords: eHealth; big data; deep learning; watson; spark; decision support system; prevention pathways
Online: 15 November 2018 (04:14:36 CET)
Data collection and analysis are becoming more and more important in a variety of application domains as long as the novel technologies advance. At the same time, we are experiencing a growing need for human-machine interaction with expert systems pushing research through new knowledge representation models and interaction paradigms. In particular, in the last years eHealth - that indicates all the health-care practices supported by electronic elaboration and remote communications - calls for the availability of smart environment and big computational resources. The aim of this paper is to introduce the HOLMeS (Health On-Line Medical Suggestions) framework. The introduced system proposes to change the eHealth paradigm where a trained machine learning algorithm, deployed on a cluster-computing environment, provides medical suggestion via both chat-bot and web-app modules. The chat-bot, based on deep learning approaches, is able to overcome the limitation of biased interaction between users and software, exhibiting a human-like behavior. Results demonstrate the effectiveness of the machine learning algorithms showing 74.65% of Area Under ROC Curve (AUC) when first-level features are used to assess the occurrence of different prevention pathways. When disease-specific features are added, HOLMeS shows 86.78% of AUC achieving a more specific prevention pathway evaluation.
ARTICLE | doi:10.20944/preprints201704.0089.v1
Subject: Materials Science, General Materials Science Keywords: ZnO; ceramic nanopowders; Segmented Flow Tubular Reactor (SFTR); Spark Plasma Sintering (SPS)
Online: 14 April 2017 (12:11:50 CEST)
Nanopowders are continuously under investigation as they open new perspectives in numerous fields. There are two main challenges to stimulate their development: sufficient low-cost high throughput synthesis methods leading to a production with well-defined and reproducible properties, and for ceramics, conservation of their nanostructure after sintering. In this context, this paper presents the synthesis of a pure nanosized powder of ZnO (dv50 ~ 60 nm, easily redispersable) by using a continuous Segmented Flow Tubular Reactor (SFTR), which has previously shown its versatility and its robustness, ensuring a high powder quality and reproducibility over time. A higher scale of production can be achieved based on a “scale-out” concept by replicating the tubular reactors. The sinterability of ZnO nanopowders synthesized by the SFTR was studied, by natural sintering at 900 °C and 1100 °C, and Spark Plasma Sintering (SPS) at 900 °C. The performances of the synthesized nanopowder were compared to a commercial ZnO nanopowder of high quality. The samples obtained from the synthesized nanopowder could not be densified at low temperature by traditional sintering, whereas SPS led to a fully dense material after only 5 minutes at 900 °C, while limiting the grain growth and thus leading to a nanostructured material.
ARTICLE | doi:10.20944/preprints202301.0501.v1
Subject: Materials Science, Metallurgy Keywords: titanium alloy; ultrafine-grained microstructure; equal channel angular pressing; spark plasma sintering; diffusion welding; corrosion; hat salt corrosion; diffusion; grain boundary.
Online: 27 January 2023 (10:29:04 CET)
A diffusion welding of coarse-grained and ultrafine-grained (UFG) specimens of titanium near-α alloy Ti-5Al-2V used in nuclear power engineering was made by Spark Plasma Sintering. The failure of the welded specimens in the conditions of hot salt corrosion and of electrochemical corrosion were shown to have preferentially intercrystalline character. In the case of presence of macrodefects, crevice corrosion of the welded joints was observed. The resistance of the alloys against the intercrystalline corrosion was found to be determined by the concentration of vanadium at the titanium grain boundaries, by the size and volume fraction of the β-phase particles and by the presence of micro- and macropores in the welded joints. The specimens of the welded joints of the UFG alloy have higher hardness, hot salt corrosion resistance and the electrochemical corrosion.
REVIEW | doi:10.20944/preprints202211.0161.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: High Performance Computing (HPC); big data; High Performance Data Analytics (HPDS); con-vergence; data locality; spark; Hadoop; design patterns; process mapping; in-situ data analysis
Online: 9 November 2022 (01:38:34 CET)
Big data has revolutionised science and technology leading to the transformation of our societies. High Performance Computing (HPC) provides the necessary computational power for big data analysis using artificial intelligence and methods. Traditionally HPC and big data had focused on different problem domains and had grown into two different ecosystems. Efforts have been underway for the last few years on bringing the best of both paradigms into HPC and big converged architectures. Designing HPC and big data converged systems is a hard task requiring careful placement of data, analytics, and other computational tasks such that the desired performance is achieved with the least amount of resources. Energy efficiency has become the biggest hurdle in the realisation of HPC, big data, and converged systems capable of delivering exascale and beyond performance. Data locality is a key parameter of HPDA system design as moving even a byte costs heavily both in time and energy with an increase in the size of the system. Performance in terms of time and energy are the most important factors for users, particularly energy, due to it being the major hurdle in high performance system design and the increasing focus on green energy systems due to environmental sustainability. Data locality is a broad term that encapsulates different aspects including bringing computations to data, minimizing data movement by efficient exploitation of cache hierarchies, reducing intra- and inter-node communications, locality-aware process and thread mapping, and in-situ and in-transit data analysis. This paper provides an extensive review of the cutting-edge on data locality in HPC, big data, and converged systems. We review the literature on data locality in HPC, big data, and converged environments and discuss challenges, opportunities, and future directions. Subsequently, using the knowledge gained from this extensive review, we propose a system architecture for future HPC and big data converged systems. To the best of our knowledge, there is no such review on data locality in converged HPC and big data systems.