Tutorials | EuroMPI 2016 : Edinburgh September

Tutorials at EuroMPI 2016

Machine Learning at Scale
Survival in an MPI World

Machine Learning at Scale - Tom Ashby, IMEC, Belgium, & Tom Vander Aa, Exascience Lab at IMEC, Belgium

This tutorial aims to give the HPC audience insight into the needs and opportunities for using HPC tools for machine learning on large data sets. It will first give an overview different machine learning methods and what the opportunities are for applying HPC infrastructure. Next it will give insights into a concrete implementation of collaborative filtering based on standard HPC libraries including MPI and GASPI. For this application, we show that an HPC oriented implementation can beat the state of the art in terms of performance and can better take advantage of modern HPC hardware.

Survival in an MPI World - George Bosilca, Innovative Computing Laboratory, University of Tennessee

As supercomputers are entering an era of massive parallelism, the frequency of failures, and the costs incurred to prevent such failures from impacting applications, is expected to grow significantly. Unlike more traditional fault management methods, user level fault tolerance techniques have the potential to prevent full-scale application restart and therefore lower the cost incurred for each failure.

However, in exchange for lowering cost, these techniques require communication middlewares able to detect and notify failures and resume communications afterward in addition to modifications to the applications. In the context of MPI, the Fault Tolerance Working Group has been proposing to extend the MPI API to restore communication capabilities, while maintaining the extreme level of performance to which MPI users are accustomed. This led to the design of User Level Failure Mitigation (ULFM), a minimal extension of the MPI specification that aims to provide users with the basic building blocks and tools to construct higher level abstractions and introduce resilience in their applications.

In this tutorial, we will present a holistic approach to fault tolerance by introducing multiple fault management techniques, while maintaining the focus on ULFM. We will engage participants in implementing a range of common fault tolerant application patterns. We will then introduce a small example of linear algebra based applications and demonstrate, by example, how to transform it into a resilient application.