Performance Evaluation of MPI Vs. Apache Spark for Condition Based Maintenance Data
Haupt, T., Jelinek, B., Card, A., & Henley, G. (2020). Performance Evaluation of MPI Vs. Apache Spark for Condition Based Maintenance Data. Intelligent Computing. Virtual Conference: Springer. DOI:10.1007/978-3-030-52249-0_3.
This paper presents the results of an exploratory research program to compare the performance of typical data analysis patterns following two approaches: one an MPI-based code in a classical HPC Linux cluster with a Lustre parallel file system and the other, a Hadoop environment over HDFS parallel file system. The selected analysis patterns relate to the requirements for building a system for condition-based maintenance (CBM) to efficiently evaluate daily files from thousands of vehicles. A similar rate of reading HDF5 files from Lustre as reading parquet files from HDFS is observed. However, the first results indicate much better performance of an MPI implementation in Python than the equivalent implementation using SparkR, with its built-in functions, in the Hadoop environment. This result is surprising, but consistent with the results reported by other authors. Furthermore, the scalability of the MPI code has been tested, indicating a good performance of the Lustre file system.