RCE 84: Scalable Checkpoint/Restart

Created on Thursday, 12 December 2013 00:51
Written by Brock Palen

Scalable Checkpoint/Restart

Brock Palen and Jeff Squyres speak with Kathryn Mohror and Adam Moody about Scalable Checkpoint/Restart (SCR). An open-source library for implementing multilevel checkpointing in clustered systems.

MP3 (Right Click Save As)

Kathryn Mohror is a computer scientist on the Scalability Team (https://scalability.llnl.gov/) at the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory (LLNL). Kathryn’s research on high-end computing systems is currently focused on scalable fault tolerant computing and performance measurement and analysis. Her other research interests include scalable automated performance analysis and tuning, parallel file systems, and parallel programming paradigms. Kathryn has been working at LLNL since 2010.

Adam Moody works within the Livermore Computing Center at Lawrence Livermore National Laboratory. He supports people using computer center resources, and his focus lies in scalable communication algorithms and fault tolerance. He contributes to a number of open source software projects for high-performance computing including MPI, scalable process group representation and communications, parallel sorting, parallel file management, and the Scalable Checkpoint/Restart library. He graduated from The Ohio State University in 2003, and he is and always will be an avid fan of the Buckeyes.