Salishan Conference on High Speed Computing

April 27-30, 2015

Theme: Resilience At Scale

Over the past 5–6 years, there has been significant research, studies, and several workshops related to resilience at scale and significant progress has been made in many related domains. Despite this progress, the exascale resilience problem is not fully understood, and the community is still facing the difficult challenge of ensuring that exascale applications complete and generate correct results while running on unreliable systems. System developers and researchers are working to characterize and quantify the types and frequencies of known faults and methods to mitigate them. However, it is not clear that this knowledge has been effectively passed to applications developers or that they fully understand the broad range of issues that a fault-prone environment will bring to the development and usage of their codes.
Developers of high performance computing (HPC) codes face two key questions: What are the reliability requirements for their particular code and how do they construct it and execute it to meet these requirements? Answers to the first question can largely be determined by analysis of use cases based on expected workload and workflow. Finding answers to the second question is much more difficult and requires deep and broad knowledge of the expected fault environment as well as the landscape of technology and methods available to respond to and manage these faults. Novel numerical methods or more stochastic approaches may be required to meet accuracy requirements in the face of undetectable soft errors. Developers might use standard fault notification application programmer interfaces (APIs) and adaptive system runtimes to respond to node failures. Burst buffers can provide a solution for in-memory data replication and automated recovery from crashes. In any case, application developers must be aware of these potential solutions and have the ability to feed their requirements to the developers of these products.
The goal of this year’s Salishan conference is to expose the audience to the broad range of resilience issues that will affect the HPC community. Invited talks will focus on recent research in areas that are particularly important to the development of resilient applications on exascale systems. Application developers will learn about the diverse fault environments they can expect to face in the near future. Hardware providers, algorithm developers, and system software and library developers will have ample opportunity to hear directly from application developers on what they expect and can tolerate regarding resiliency at scale. The main conference goal, as always, is to provide ample forums for discussion among participants to provide feedback, discuss issues, explain concerns, develop collaborations, and recommend solutions. This year the conference organizing committee will also seek to foster increased participation with younger staff, women, and minorities.

Session 1: The Fault Environment

This session is designed to expose the audience to the faults we expect to face as systems move to exascale. What components of the system are expected to fail and at what rate? What is the nature of these failures? What failures can we detect and what is the cost of detection to applications? What are the potential effects of failures that we can’t detect? What does it really mean for a large-scale system to be “reliable”? Is this fault environment expected to improve or worsen in the future and to what extent? What are the expected trade-offs between cost and system reliability? What can we learn from the private sector and their approaches to reliability?

Session 2: Resilient Numerical Methods

This session will deal with potential approaches that deal with soft errors, both detectable and undetectable, in machine logic. How can error and uncertainty be quantified? Is redundant computation a reasonable approach? How do stochastic methods compare with deterministic methods in the presence of undetected soft errors? Can compilers help with soft error mitigation? To what extent will we sacrifice flexibility with known performance standards? What are applications currently doing to mitigate faults and what are the costs?

Session 3: System Software and APIs

This session will expose the application community to current research in system software that is designed to support resilient applications on large-scale systems. How will system software enable resilience? How transparent will it be to applications? What is the expected cost to applications (for example, data redundancy)? Has progress been made toward development of a standardized fault-handling model? How will system software serve as the interface between applications and the underlying hardware fault environment?

Session 4: Data Analysis on Uncertain Data

This session examines methods and techniques to deal with potentially corrupt or unreliable data (produced by unreliable machines). Can data analysis tools provide a means to detect unreliable data? To quantify the uncertainty? What additional information can the system supply (for example, detected failures and average rate of undetected failures) to help analysis software predict the reliability of data?

Session 5: Future Application Development Environment

This session deals with how the application development environment might change in the face of a constant-fault environment. Will new languages emerge with features that enable response to faults? Will large-scale databases re-emerge as data caches for continuous restart? Will we move to a more dynamic computing environment with codes running concurrently and elastically with other codes? At what level do application developers want to deal with failure (for example, full system support and application notification and response)? How important is portability? What are the most important tools to help application developers deal with large-scale system failures?

April 27-30, 2015​

April 27-30, 2015