A Framework for Statistical Analysis of Datasets on Heterogeneous Clusters
Carino, R.L., & Banicescu, I. (2005). A Framework for Statistical Analysis of Datasets on Heterogeneous Clusters. Proceedings of the 2005 IEEE International Conference on Cluster Computing. Burlington, MA: IEEE Computer Society Press. (On CDROM).
This paper proposes a framework for the statistical analysis of multiple related datasets on heterogeneous clusters. The analysis procedure, which is separate from the framework, may have a limited degree of concurrency that only a small number of processors is needed to execute the procedure. Further, the datasets may have a wide range of sizes leading to large differences of dataset analysis times. The framework partitions the processors assigned to it by the cluster scheduler into processor groups, the maximum size of a group being chosen to match the degree of concurrency in the analysis procedure. The framework also employs dynamic loop scheduling to address the load imbalance factors arising from the variability of the computational loads of the datasets, as well as the unpredictable irregularities of the cluster environment. Results from preliminary tests of using the framework to fit gamma-ray burst datasets with vector functional coefficient autoregressive time series models on 64 processors of a heterogeneous general-purpose Linux cluster demonstrate the effectiveness of the framework.