Wednesday: Paper Abstracts

Session 6 Track A: Challenges and Extensions

Full paper 12: Optimization of Message Passing Services on POWER8 InfiniBand Clusters
Sameer Kumar, Robert Blackmore, Sameh Sharkawi, Nysal Jan K. A., Amith Mamidala and Chris Ward
Full paper 13: Using InfiniBand Hardware Gather-Scatter Capabilities to Optimize MPI All-to-All
Richard Graham, Ana Gainaru, Artem Polyaiov and Gilad Shainer
Short paper 2: Revisiting RDMA Registration in the Context of Lightweight Multi-kernels
Balazs Gerofi, Masamichi Takagi and Yutaka Ishikawa
Short paper 3: Optimizing One-sided Accumulate Operations in Open MPI
Nathan Hjelm

Session 6 Track B: Parallel Applications using MPI

Full paper 14: Runtime Correctness Analysis of MPI-3 Nonblocking Collectives
Tobias Hilbrich, Matthias Weber, Joachim Protze, Bronis R. de Supinski and Wolfgang E. Nagel
Full paper 15: CAF Events Implementation Using MPI-3 Capabilities
Alessandro Fanfarillo and Jeff Hammond
Short paper 4: Allowing MPI tools builders to forget about Fortran
Soren Rasmussen, Martin Schulz and Kathryn Mohror

Full paper 12: Optimization of Message Passing Services on POWER8 InfiniBand Clusters
Sameer Kumar, Robert Blackmore, Sameh Sharkawi, Nysal Jan K. A., Amith Mamidala and Chris Ward

We present performance enhancements to MPI libraries on POWER8 IB clusters. We explore optimizations in the IBM PAMI libraries. We bypass IB VERBS via low-level calls resulting in low latencies and high message rates. MPI is enabled by extension of both MPICH and Open MPI to call PAMI libraries. We also explore optimizations for GPU-to-GPU communication with minimal processor involvement on POWER8. We achieve MPI message rate of 186 MMPS and scalable performance in the QBOX and AMG applications.

TOP_______________________________________________

Full paper 13: Using InfiniBand Hardware Gather-Scatter Capabilities to Optimize MPI All-to-All
Richard Graham, Ana Gainaru, Artem Polyaiov and Gilad Shainer

The All-to-All collective is a data intensive algorithm. Three variants of the Bruck algorithm are presented, that differ in data layout at intermediate steps of the algorithm. Mellanox's InfiniBand hardware support for data scatter/gather is used to selectively replace CPU based buffer packing and unpacking. Using this offoad capability reduces the eight and sixteen byte all-to-all latency on 1024 MPI Process by 12% and 9.9%, respectively, with a Broadwell based system using ConnectX-4 HCA's.

TOP_______________________________________________

Short paper 2: Revisiting RDMA Registration in the Context of Lightweight Multi-kernels
Balazs Gerofi, Masamichi Takagi and Yutaka Ishikawa

This paper introduces various optimizations to eliminate RDMA memory registration cost in lightweight multi-kernel architectures, where HPC specialized lightweight kernels (LWKs) run side-by-side with Linux on compute nodes. In particular, we propose a safe RDMA pre-registration mechanism combined with lazy memory unmapping. We demonstrate up to two orders of magnitude improvement in RDMA registration latency and up to 15% improvement on MPI_Allreduce() for large message sizes.

TOP_______________________________________________

Short paper 3: An Evaluation of the One-Sided Performance in Open MPI
Nathan Hjelm

Open MPI provides an implementation of the MPI-3 standard supporting native communication over a range of network interconnects. As of version 2.0.0 Open MPI provides a new implementation of the MPI-3.1 RMA specification. This implementation uses native network RDMA and atomic operations utilizing extensions to the byte transport layer (BTL) interface. For this work, we present further extensions to the BTL interface to export additional hardware support for accumulate operations.

TOP_______________________________________________

Full paper 14: Runtime Correctness Analysis of MPI-3 Nonblocking Collectives
Tobias Hilbrich, Matthias Weber, Joachim Protze, Bronis R. de Supinski and Wolfgang E. Nagel

The Message Passing Interface (MPI) includes nonblocking collective operations to allow better overlap between computation and communication. These new operations enable complex data movement between large numbers of processes. Their asynchronous behavior hides and complicates the detection of defects in their use. We highlight a lack of correctness tool support for these operations and extend the MUST runtime MPI correctness tool to alleviate this complexity.

TOP_______________________________________________

Full paper 15: CAF Events Implementation Using MPI-3 Capabilities
Alessandro Fanfarillo and Jeff Hammond

MPI-3 adds important extensions to MPI-2, like a simplified semantic for the one-sided routines and a new tool interface. These and other features make MPI-3 suitable for being the transport layer of PGAS languages like Coarray Fortran. Among the Coarray Fortran 2015 features, one of the most relevant is events. In this paper, we analyze two implementations of events and show how to dynamically select the best implementation, according to the capabilities provided by the MPI implementation.

TOP_______________________________________________

Short paper 4: Allowing MPI tools builders to forget about Fortran
Soren Rasmussen, Martin Schulz and Kathryn Mohror

C tools builders are forced to deal with all the Fortran and C interoperability issues when using MPI. A C based tool has to intercept the Fortran MPI routines and marshal arguments between C and Fortran. There is a subset of Fortran MPI routines that also cannot be completed from C. It is difficult and time consuming to cope with all these issues. WMPI is a wrapper generator that solves this by generating multiple lightweight wrappers to solve these issues.

TOP_______________________________________________

Last updated: 26 Aug 2016 at 10:25

This is an archived website, preserved and hosted by EPCC at the University of Edinburgh. Please email info [at] epcc [dot] ed [dot] ac [dot] uk for enquiries about this archive.