We apologize if you receive multiple copies of this call for papers.
--------------------------------------------------------------------------------
10th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids http://www.csm.ornl.gov/srt/conferences/Resilience/2017
in conjunction with
the 23rd International European Conference on Parallel and Distributed Computing (Euro-Par), Santiago de Compostela, Spain, August 28 - September 1, 2017 http://europar2017.usc.es
Overview:
Resilience is a critical challenge as high performance computing (HPC) systems continue to increase component counts, individual component reliability decreases (such as due to shrinking process technology and near-threshold voltage (NTV) operation), and software complexity increases. Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is essential to ensure the success of the extreme-scale HPC systems, cluster computing environments, Grid computing infrastructures, and Cloud computing services.
While a fault (e.g., a bug or stuck bit) is the cause of an error, its manifestation as a state change is considered an error (e.g., a bad value or incorrect execution), and the transition to an incorrect service is observed as a failure (e.g., an application abort or system crash). A failure in a computing system is typically observed through an application abort or a full/partial service or system outage. A detectable correctable error is often transparently handled by hardware, such as a single bit flip in memory that is protected with single-error correction double-error detection (SECDED) error correcting code (ECC). A detectable uncorrectable error (DUE) typically results in a failure, such as multiple bit flips in the same addressable word that escape SECDED ECC correction, but not detection, and ultimately cause an application abort. An undetectable error (UE) may result in silent data corruption (SDC), e.g., an incorrect application output. There are many other types of hardware and software faults, errors, and failures in computing systems.
Resilience for HPC systems encompasses a wide spectrum of fundamental and applied research and development, including theoretical foundations, fault detection and prediction, monitoring and control, end-to-end data integrity, enabling infrastructure, and resilient solvers and algorithm-based fault tolerance. This workshop brings together experts in the community to further research and development in HPC resilience and to facilitate exchanges across the computational paradigms of extreme-scale HPC, cluster computing, Grid computing, and Cloud computing.
Submission Guidelines:
Authors are invited to submit papers electronically in English in PDF format. Submitted manuscripts should be structured as technical papers and BETWEEN 10 AND 12 PAGES, including figures, tables and references, using Springer's Lecture Notes in Computer Science (LNCS) format at http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0. Papers with less than 10 or more than 12 pages will not be accepted due to publisher guidelines. Submissions should include abstract, key words and the e-mail address of the corresponding author. Papers not conforming to these guidelines may be returned without review. All manuscripts will be reviewed and will be judged on correctness, originality, technical strength, significance, quality of presentation, and interest and relevance to the conference attendees. Submitted papers must represent original unpublished research that is not currently under review for any other conference or journal. Papers not following these guidelines will be rejected without review and further action may be taken, including (but not limited to) notifications sent to the heads of the institutions of the authors and sponsors of the conference. Submissions received after the due date or not appropriately structured may also not be considered. The proceedings will be published in Springer's LNCS as post-conference proceedings. At least one author of an accepted paper must register for and attend the workshop for inclusion in the proceedings. Authors may contact the workshop program chairs for more information.
Important websites:
- Resilience 2017 Website: http://www.csm.ornl.gov/srt/conferences/Resilience/2017 - Resilience 2017 Submissions: https://easychair.org/conferences/?conf=europar2017workshops - Euro-Par 2017 website: http://europar2017.usc.es
Topics of interest include, but are not limited to:
- Theoretical foundations for resilience: - Metrics and measurement - Statistics and optimization - Simulation and emulation - Formal methods - Efficiency modeling and uncertainty quantification
- Fault detection and prediction: - Statistical analyses - Machine learning - Anomaly detection - Data and information collection - Visualization
- Monitoring and control for resilience: - Platform and application monitoring - Response and recovery - RAS theory and performability - Application and platform knobs - Tunable fidelity and quality of service
- End-to-end data integrity: - Fault tolerant design - Degraded modes - Forward migration and verification - Fault injection - Soft errors - Silent data corruption
- Enabling infrastructure for resilience: - RAS systems - System software and middleware - Programming models - Tools - Next-generation architectures
- Resilient solvers and algorithm-based fault tolerance: - Algorithmic detection and correction of hard and soft faults - Resilient algorithms - Fault tolerant numerical methods - Robust iterative algorithms - Scalability of resilient solvers and algorithm-based fault tolerance
Important Dates:
- Workshop papers due: May 5, 2017 - Workshop author notification: June 16, 2017 - Workshop early registration: TBD - Workshop paper (for informal workshop proceedings): July 21, 2017 - Workshop camera-ready papers: October 3, 2017
General Co-Chairs:
- Stephen L. Scott Senior Research Scientist - Systems Research Team Tennessee Tech University and Oak Ridge National Laboratory, USA scottsl@ornl.gov - Chokchai (Box) Leangsuksun, SWEPCO Endowed Associate Professor of Computer Science Louisiana Tech University, USA box@latech.edu
Program Co-Chairs:
- Patrick G. Bridges University of New Mexico, USA bridges@cs.unm.edu - Christian Engelmann Oak Ridge National Laboratory , USA engelmannc@ornl.gov
Program Committee:
- Ferrol Aderholdt, Oak Ridge National Laboratory, USA - Dorian Arnold, University of New Mexico, USA - Rizwan Ashraf, Oak Ridge National Laboratory, USA - Wesley Bland, Intel Corporation, USA - Hans-Joachim Bungartz, Technical University of Munich, Germany - Franck Cappello, Argonne National Laboratory and University of Illinois at Urbana-Champaign, USA - Marc Casas, Barcelona Supercomputer Center, Spain - Zizhong Chen, University of California at Riverside, USA - Robert Clay, Sandia National Laboratories, USA - Miguel Correia, Universidade de Lisboa, Portugal - Nathan DeBardeleben, Los Alamos National Laboratory, USA - James Elliott, Sandia National Laboratories, USA - Kurt Ferreira, Sandia National Laboratories, USA - Michael Heroux, Sandia National Laboratories, USA - Saurabh Hukerikar, Oak Ridge National Laboratory, USA - Dieter Kranzlmueller, Ludwig-Maximilians University of Munich, Germany - Sriram Krishnamoorthy, Pacific Northwest National Laboratory, USA - Ignacio Laguna, Lawrence Livermore National Laboratory, USA - Scott Levy, University of New Mexico, USA - Kathryn Mohror, Lawrence Livermore National Laboratory, USA - Christine Morin, INRIA Rennes, France - Dirk Pflueger, University of Stuttgart, Germany - Nageswara Rao, Oak Ridge National Laboratory, USA - Alexander Reinefeld, Zuse Institute Berlin, Germany - Rolf Riesen, Intel Corporation, USA - Yves Robert, ENS Lyon, France - Thomas Ropars, Universite Grenoble Alpes, France - Martin Schulz, Lawrence Livermore National Laboratory, USA - Keita Teranishi, Sandia National Laboratories, USA
--
Christian Engelmann, Ph.D.
R&D Staff Scientist Computer Science Research Group Computer Science and Mathematics Division Oak Ridge National Laboratory
Mail: P.O. Box 2008, Oak Ridge, TN 37831-6173, USA Phone: +1 (865) 574-3132 / Fax: +1 (865) 576-5491 e-Mail: engelmannc@ornl.gov / Home: www.christian-engelmann.info
We apologize if you receive multiple copies of this call for papers. The submission deadline has been extended to May 12.
--------------------------------------------------------------------------------
10th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids <http://www.csm.ornl.gov/srt/conferences/Resilience/2017 http://www.csm.ornl.gov/srt/conferences/Resilience/2017>
in conjunction with
the 23rd International European Conference on Parallel and Distributed Computing (Euro-Par), Santiago de Compostela, Spain, August 28 - September 1, 2017 <http://europar2017.usc.es http://europar2017.usc.es/>
Overview:
Resilience is a critical challenge as high performance computing (HPC) systems continue to increase component counts, individual component reliability decreases (such as due to shrinking process technology and near-threshold voltage (NTV) operation), and software complexity increases. Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is essential to ensure the success of the extreme-scale HPC systems, cluster computing environments, Grid computing infrastructures, and Cloud computing services.
While a fault (e.g., a bug or stuck bit) is the cause of an error, its manifestation as a state change is considered an error (e.g., a bad value or incorrect execution), and the transition to an incorrect service is observed as a failure (e.g., an application abort or system crash). A failure in a computing system is typically observed through an application abort or a full/partial service or system outage. A detectable correctable error is often transparently handled by hardware, such as a single bit flip in memory that is protected with single-error correction double-error detection (SECDED) error correcting code (ECC). A detectable uncorrectable error (DUE) typically results in a failure, such as multiple bit flips in the same addressable word that escape SECDED ECC correction, but not detection, and ultimately cause an application abort. An undetectable error (UE) may result in silent data corruption (SDC), e.g., an incorrect application output. There are many other types of hardware and software faults, errors, and failures in computing systems.
Resilience for HPC systems encompasses a wide spectrum of fundamental and applied research and development, including theoretical foundations, fault detection and prediction, monitoring and control, end-to-end data integrity, enabling infrastructure, and resilient solvers and algorithm-based fault tolerance. This workshop brings together experts in the community to further research and development in HPC resilience and to facilitate exchanges across the computational paradigms of extreme-scale HPC, cluster computing, Grid computing, and Cloud computing.
Submission Guidelines:
Authors are invited to submit papers electronically in English in PDF format. Submitted manuscripts should be structured as technical papers and BETWEEN 10 AND 12 PAGES, including figures, tables and references, using Springer's Lecture Notes in Computer Science (LNCS) format at <http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0 http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0>. Papers with less than 10 or more than 12 pages will not be accepted due to publisher guidelines. Submissions should include abstract, key words and the e-mail address of the corresponding author. Papers not conforming to these guidelines may be returned without review. All manuscripts will be reviewed and will be judged on correctness, originality, technical strength, significance, quality of presentation, and interest and relevance to the conference attendees. Submitted papers must represent original unpublished research that is not currently under review for any other conference or journal. Papers not following these guidelines will be rejected without review and further action may be taken, including (but not limited to) notifications sent to the heads of the institutions of the authors and sponsors of the conference. Submissions received after the due date or not appropriately structured may also not be considered. The proceedings will be published in Springer's LNCS as post-conference proceedings. At least one author of an accepted paper must register for and attend the workshop for inclusion in the proceedings. Authors may contact the workshop program chairs for more information.
Important websites:
- Resilience 2017 Website: <http://www.csm.ornl.gov/srt/conferences/Resilience/2017 http://www.csm.ornl.gov/srt/conferences/Resilience/2017> - Resilience 2017 Submissions: <https://easychair.org/conferences/?conf=europar2017workshops https://easychair.org/conferences/?conf=europar2017workshops> - Euro-Par 2017 website: <http://europar2017.usc.es http://europar2017.usc.es/>
Topics of interest include, but are not limited to:
- Theoretical foundations for resilience: - Metrics and measurement - Statistics and optimization - Simulation and emulation - Formal methods - Efficiency modeling and uncertainty quantification
- Fault detection and prediction: - Statistical analyses - Machine learning - Anomaly detection - Data and information collection - Visualization
- Monitoring and control for resilience: - Platform and application monitoring - Response and recovery - RAS theory and performability - Application and platform knobs - Tunable fidelity and quality of service
- End-to-end data integrity: - Fault tolerant design - Degraded modes - Forward migration and verification - Fault injection - Soft errors - Silent data corruption
- Enabling infrastructure for resilience: - RAS systems - System software and middleware - Programming models - Tools - Next-generation architectures
- Resilient solvers and algorithm-based fault tolerance: - Algorithmic detection and correction of hard and soft faults - Resilient algorithms - Fault tolerant numerical methods - Robust iterative algorithms - Scalability of resilient solvers and algorithm-based fault tolerance
Important Dates:
- Workshop papers due: May 12, 2017 - Workshop author notification: June 16, 2017 - Workshop early registration: TBD - Workshop paper (for informal workshop proceedings): July 21, 2017 - Workshop camera-ready papers: October 3, 2017
General Co-Chairs:
- Stephen L. Scott Senior Research Scientist - Systems Research Team Tennessee Tech University and Oak Ridge National Laboratory, USA scottsl@ornl.gov mailto:scottsl@ornl.gov - Chokchai (Box) Leangsuksun, SWEPCO Endowed Associate Professor of Computer Science Louisiana Tech University, USA box@latech.edu mailto:box@latech.edu
Program Co-Chairs:
- Patrick G. Bridges University of New Mexico, USA bridges@cs.unm.edu mailto:bridges@cs.unm.edu - Christian Engelmann Oak Ridge National Laboratory , USA engelmannc@ornl.gov mailto:engelmannc@ornl.gov
Program Committee:
- Ferrol Aderholdt, Oak Ridge National Laboratory, USA - Dorian Arnold, University of New Mexico, USA - Rizwan Ashraf, Oak Ridge National Laboratory, USA - Wesley Bland, Intel Corporation, USA - Hans-Joachim Bungartz, Technical University of Munich, Germany - Franck Cappello, Argonne National Laboratory and University of Illinois at Urbana-Champaign, USA - Marc Casas, Barcelona Supercomputer Center, Spain - Zizhong Chen, University of California at Riverside, USA - Robert Clay, Sandia National Laboratories, USA - Miguel Correia, Universidade de Lisboa, Portugal - Nathan DeBardeleben, Los Alamos National Laboratory, USA - James Elliott, Sandia National Laboratories, USA - Kurt Ferreira, Sandia National Laboratories, USA - Michael Heroux, Sandia National Laboratories, USA - Saurabh Hukerikar, Oak Ridge National Laboratory, USA - Dieter Kranzlmueller, Ludwig-Maximilians University of Munich, Germany - Sriram Krishnamoorthy, Pacific Northwest National Laboratory, USA - Ignacio Laguna, Lawrence Livermore National Laboratory, USA - Scott Levy, University of New Mexico, USA - Kathryn Mohror, Lawrence Livermore National Laboratory, USA - Christine Morin, INRIA Rennes, France - Dirk Pflueger, University of Stuttgart, Germany - Nageswara Rao, Oak Ridge National Laboratory, USA - Alexander Reinefeld, Zuse Institute Berlin, Germany - Rolf Riesen, Intel Corporation, USA - Yves Robert, ENS Lyon, France - Thomas Ropars, Universite Grenoble Alpes, France - Martin Schulz, Lawrence Livermore National Laboratory, USA - Keita Teranishi, Sandia National Laboratories, USA
--
Christian Engelmann, Ph.D.
R&D Staff Scientist Computer Science Research Group Computer Science and Mathematics Division Oak Ridge National Laboratory
Mail: P.O. Box 2008, Oak Ridge, TN 37831-6173, USA Phone: +1 (865) 574-3132 / Fax: +1 (865) 576-5491 e-Mail: engelmannc@ornl.gov mailto:engelmannc@ornl.gov / Home: www.christian-engelmann.info http://www.christian-engelmann.info/
We apologize if you receive multiple copies of this call for papers. The submission deadline is May 12.
--------------------------------------------------------------------------------
10th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids <http://www.csm.ornl.gov/srt/conferences/Resilience/2017 http://www.csm.ornl.gov/srt/conferences/Resilience/2017>
in conjunction with
the 23rd International European Conference on Parallel and Distributed Computing (Euro-Par), Santiago de Compostela, Spain, August 28 - September 1, 2017 <http://europar2017.usc.es http://europar2017.usc.es/>
Overview:
Resilience is a critical challenge as high performance computing (HPC) systems continue to increase component counts, individual component reliability decreases (such as due to shrinking process technology and near-threshold voltage (NTV) operation), and software complexity increases. Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is essential to ensure the success of the extreme-scale HPC systems, cluster computing environments, Grid computing infrastructures, and Cloud computing services.
While a fault (e.g., a bug or stuck bit) is the cause of an error, its manifestation as a state change is considered an error (e.g., a bad value or incorrect execution), and the transition to an incorrect service is observed as a failure (e.g., an application abort or system crash). A failure in a computing system is typically observed through an application abort or a full/partial service or system outage. A detectable correctable error is often transparently handled by hardware, such as a single bit flip in memory that is protected with single-error correction double-error detection (SECDED) error correcting code (ECC). A detectable uncorrectable error (DUE) typically results in a failure, such as multiple bit flips in the same addressable word that escape SECDED ECC correction, but not detection, and ultimately cause an application abort. An undetectable error (UE) may result in silent data corruption (SDC), e.g., an incorrect application output. There are many other types of hardware and software faults, errors, and failures in computing systems.
Resilience for HPC systems encompasses a wide spectrum of fundamental and applied research and development, including theoretical foundations, fault detection and prediction, monitoring and control, end-to-end data integrity, enabling infrastructure, and resilient solvers and algorithm-based fault tolerance. This workshop brings together experts in the community to further research and development in HPC resilience and to facilitate exchanges across the computational paradigms of extreme-scale HPC, cluster computing, Grid computing, and Cloud computing.
Submission Guidelines:
Authors are invited to submit papers electronically in English in PDF format. Submitted manuscripts should be structured as technical papers and BETWEEN 10 AND 12 PAGES, including figures, tables and references, using Springer's Lecture Notes in Computer Science (LNCS) format at <http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0 http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0>. Papers with less than 10 or more than 12 pages will not be accepted due to publisher guidelines. Submissions should include abstract, key words and the e-mail address of the corresponding author. Papers not conforming to these guidelines may be returned without review. All manuscripts will be reviewed and will be judged on correctness, originality, technical strength, significance, quality of presentation, and interest and relevance to the conference attendees. Submitted papers must represent original unpublished research that is not currently under review for any other conference or journal. Papers not following these guidelines will be rejected without review and further action may be taken, including (but not limited to) notifications sent to the heads of the institutions of the authors and sponsors of the conference. Submissions received after the due date or not appropriately structured may also not be considered. The proceedings will be published in Springer's LNCS as post-conference proceedings. At least one author of an accepted paper must register for and attend the workshop for inclusion in the proceedings. Authors may contact the workshop program chairs for more information.
Important websites:
- Resilience 2017 Website: <http://www.csm.ornl.gov/srt/conferences/Resilience/2017 http://www.csm.ornl.gov/srt/conferences/Resilience/2017> - Resilience 2017 Submissions: <https://easychair.org/conferences/?conf=europar2017workshops https://easychair.org/conferences/?conf=europar2017workshops> - Euro-Par 2017 website: <http://europar2017.usc.es http://europar2017.usc.es/>
Topics of interest include, but are not limited to:
- Theoretical foundations for resilience: - Metrics and measurement - Statistics and optimization - Simulation and emulation - Formal methods - Efficiency modeling and uncertainty quantification
- Fault detection and prediction: - Statistical analyses - Machine learning - Anomaly detection - Data and information collection - Visualization
- Monitoring and control for resilience: - Platform and application monitoring - Response and recovery - RAS theory and performability - Application and platform knobs - Tunable fidelity and quality of service
- End-to-end data integrity: - Fault tolerant design - Degraded modes - Forward migration and verification - Fault injection - Soft errors - Silent data corruption
- Enabling infrastructure for resilience: - RAS systems - System software and middleware - Programming models - Tools - Next-generation architectures
- Resilient solvers and algorithm-based fault tolerance: - Algorithmic detection and correction of hard and soft faults - Resilient algorithms - Fault tolerant numerical methods - Robust iterative algorithms - Scalability of resilient solvers and algorithm-based fault tolerance
Important Dates:
- Workshop papers due: May 12, 2017 - Workshop author notification: June 16, 2017 - Workshop early registration: TBD - Workshop paper (for informal workshop proceedings): July 21, 2017 - Workshop camera-ready papers: October 3, 2017
General Co-Chairs:
- Stephen L. Scott Senior Research Scientist - Systems Research Team Tennessee Tech University and Oak Ridge National Laboratory, USA scottsl@ornl.gov mailto:scottsl@ornl.gov - Chokchai (Box) Leangsuksun, SWEPCO Endowed Associate Professor of Computer Science Louisiana Tech University, USA box@latech.edu mailto:box@latech.edu
Program Co-Chairs:
- Patrick G. Bridges University of New Mexico, USA bridges@cs.unm.edu mailto:bridges@cs.unm.edu - Christian Engelmann Oak Ridge National Laboratory , USA engelmannc@ornl.gov mailto:engelmannc@ornl.gov
Program Committee:
- Ferrol Aderholdt, Oak Ridge National Laboratory, USA - Dorian Arnold, University of New Mexico, USA - Rizwan Ashraf, Oak Ridge National Laboratory, USA - Wesley Bland, Intel Corporation, USA - Hans-Joachim Bungartz, Technical University of Munich, Germany - Franck Cappello, Argonne National Laboratory and University of Illinois at Urbana-Champaign, USA - Marc Casas, Barcelona Supercomputer Center, Spain - Zizhong Chen, University of California at Riverside, USA - Robert Clay, Sandia National Laboratories, USA - Miguel Correia, Universidade de Lisboa, Portugal - Nathan DeBardeleben, Los Alamos National Laboratory, USA - James Elliott, Sandia National Laboratories, USA - Kurt Ferreira, Sandia National Laboratories, USA - Michael Heroux, Sandia National Laboratories, USA - Saurabh Hukerikar, Oak Ridge National Laboratory, USA - Dieter Kranzlmueller, Ludwig-Maximilians University of Munich, Germany - Sriram Krishnamoorthy, Pacific Northwest National Laboratory, USA - Ignacio Laguna, Lawrence Livermore National Laboratory, USA - Scott Levy, University of New Mexico, USA - Kathryn Mohror, Lawrence Livermore National Laboratory, USA - Christine Morin, INRIA Rennes, France - Dirk Pflueger, University of Stuttgart, Germany - Nageswara Rao, Oak Ridge National Laboratory, USA - Alexander Reinefeld, Zuse Institute Berlin, Germany - Rolf Riesen, Intel Corporation, USA - Yves Robert, ENS Lyon, France - Thomas Ropars, Universite Grenoble Alpes, France - Martin Schulz, Lawrence Livermore National Laboratory, USA - Keita Teranishi, Sandia National Laboratories, USA
We apologize if you receive multiple copies of this notice.
-----------------------------------------------------------------------------
ScalA’17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
held in conjunction with the SC17: The International Conference on High Performance Computing, Networking, Storage and Analysis
in cooperation with ACM SIGHPC
November 13, 2017, Denver, CO, USA
<http://www.csm.ornl.gov/srt/conferences/Scala/2017 http://www.csm.ornl.gov/srt/conferences/Scala/2017>
Novel scalable scientific algorithms are needed in order to enable key science applications to exploit the computational power of large-scale systems. This is especially true for the current tier of leading petascale machines and the road to exascale computing as HPC systems continue to scale up in compute node and processor core count. These extreme-scale systems require novel scientific algorithms to hide network and memory latency, have very high computation/communication overlap, have minimal communication, and have no synchronization points.
Scientific algorithms for multi-petaflop and exa-flop systems also need to be fault tolerant and fault resilient, since the probability of faults increases with scale. Resilience at the system software and at the algorithmic level is needed as a crosscutting effort. Finally, with the advent of heterogeneous compute nodes that employ standard processors as well as GPGPUs, scientific algorithms need to match these architectures to extract the most performance. This includes different system-specific levels of parallelism as well as co-scheduling of computation. Key science applications require novel mathematical models and system software that address the scalability and resilience challenges of current- and future-generation extreme-scale HPC systems.
Submission Guidelines
Authors are invited to submit manuscripts in English structured as technical papers not exceeding 8 letter size (8.5in x 11in) pages including figures, tables, and references using the ACM format for conference proceedings. Submissions not conforming to these guidelines may be returned without review. Reference style files are available at <http://www.acm.org/sigs/publications/proceedings-templates http://www.acm.org/sigs/publications/proceedings-templates>.
All manuscripts will be reviewed and judged on correctness, originality, technical strength, and significance, quality of presentation, and interest and relevance to the workshop attendees. Submitted papers must represent original unpublished research that is not currently under review for any other conference or journal. Papers not following these guidelines will be rejected without review and further action may be taken, including (but not limited to) notifications sent to the heads of the institutions of the authors and sponsors of the conference. Submissions received after the due date, exceeding length limit, or not appropriately structured may also not be considered. At least one author of an accepted paper must register for and attend the workshop. Authors may contact the workshop program chair for more information. Papers should be submitted electronically at: <https://easychair.org/conferences/?conf=scala17 https://easychair.org/conferences/?conf=scala17>.
Full papers will be published with the SC'17 workshop proceedings in the ACM Digital Library and IEEE Xplore. Selected papers will be invited for an extended version in a special issue of the Journal of Computational Science (JoCS).
Important Dates
- Full paper submission: August 28, 2017 - Notification of acceptance: September 11, 2017 - Final paper submission (firm): October 9, 2017 - Workshop/conference early registration: TBD - Workshop: November 13, 2017
Topics of interest include, but are not limited to:
- Novel scientific algorithms that improve performance, scalability, resilience, and power efficiency - Porting scientific algorithms and applications to many-core and heterogeneous architectures - Performance and resilience limitations of scientific algorithms and applications at scale - Crosscutting approaches (system software and applications) in addressing scalability challenges - Scientific algorithms that can exploit extreme concurrency (e.g. 1 billion for exascale by 2020) - Naturally fault tolerant, self-healing, or fault oblivious scientific algorithms - Programming model and system software support for algorithm scalability and resilience
Workshop Chairs
- Vassil Alexandrov, Barcelona Supercomputing Center, Spain - Al Geist, Oak Ridge National Laboratory, USA - Jack Dongarra, University of Tennessee, Knoxville, USA
Workshop Program Chair
- Christian Engelmann, Oak Ridge National Laboratory, USA
Program Committee
- Vassil Alexandrov, Barcelona Supercomputing Center, Spain - Hartwig Anzt, University of Tennessee, Knoxville, USA - Rick Archibald, Oak Ridge National Laboratory, USA - Franck Cappello, Argonne National Laboratory and University of Illinois at Urbana Champaign, USA - Zizhong Chen, University of California, Riverside, USA - James Elliott, Sandia National Laboratories, USA - Nahid Emad, University of Versailles SQ, France - Christian Engelmann, Oak Ridge National Laboratory, USA - Wilfried Gansterer, University of Vienna, Austria - Michael Heroux, Sandia National Laboratories, USA - Kirk E. Jordan, IBM T.J. Watson Research, USA - Dieter Kranzlmueller, Ludwig-Maximilians-University Munich, Germany - Ignacio Laguna, Lawrence Livermore National Laboratory, USA - Piotr Luszczek, University of Tennessee, Knoxville, USA - Michael Mascagni, Florida State University, USA - Ron Perrot, University of Oxford, UK - Yves Robert, ENS Lyon, France - Stuart Slattery, Oak Ridge National Laboratory, USA - Keita Teranishi, Sandia National Laboratories, USA
--
Christian Engelmann, Ph.D.
R&D Staff Scientist Computer Science Research Group Computer Science and Mathematics Division Oak Ridge National Laboratory
Mail: P.O. Box 2008, Oak Ridge, TN 37831-6173, USA Phone: +1 (865) 574-3132 / Fax: +1 (865) 576-5491 e-Mail: engelmannc@ornl.gov mailto:engelmannc@ornl.gov / Home: www.christian-engelmann.info http://www.christian-engelmann.info/
We apologize if you receive multiple copies of this call for papers.
--------------------------------------------------------------------------------
11th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids <https://www.csm.ornl.gov/srt/conferences/Resilience/2018 https://www.csm.ornl.gov/srt/conferences/Resilience/2018>
in conjunction with
the 24th International European Conference on Parallel and Distributed Computing (Euro-Par), Turin, Italy August 27 - 31, 2018 <https://europar2018.org https://europar2018.org/>
Overview:
Resilience is a critical challenge as high performance computing (HPC) systems continue to increase component counts, individual component reliability decreases (such as due to shrinking process technology and near-threshold voltage (NTV) operation), and software complexity increases. Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is essential to ensure the success of the extreme-scale HPC systems, cluster computing environments, Grid computing infrastructures, and Cloud computing services.
While a fault (e.g., a bug or stuck bit) is the cause of an error, its manifestation as a state change is considered an error (e.g., a bad value or incorrect execution), and the transition to an incorrect service is observed as a failure (e.g., an application abort or system crash). A failure in a computing system is typically observed through an application abort or a full/partial service or system outage. A detectable correctable error is often transparently handled by hardware, such as a single bit flip in memory that is protected with single-error correction double-error detection (SECDED) error correcting code (ECC). A detectable uncorrectable error (DUE) typically results in a failure, such as multiple bit flips in the same addressable word that escape SECDED ECC correction, but not detection, and ultimately cause an application abort. An undetectable error (UE) may result in silent data corruption (SDC), e.g., an incorrect application output. There are many other types of hardware and software faults, errors, and failures in computing systems.
Resilience for HPC systems encompasses a wide spectrum of fundamental and applied research and development, including theoretical foundations, fault detection and prediction, monitoring and control, end-to-end data integrity, enabling infrastructure, and resilient solvers and algorithm-based fault tolerance. This workshop brings together experts in the community to further research and development in HPC resilience and to facilitate exchanges across the computational paradigms of extreme-scale HPC, cluster computing, Grid computing, and Cloud computing.
Submission Guidelines:
Authors are invited to submit papers electronically in English in PDF format. Submitted manuscripts should be structured as technical papers and BETWEEN 10 AND 12 PAGES, including figures, tables and references, using Springer's Lecture Notes in Computer Science (LNCS) format at <http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0 http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0>. Papers with less than 10 or more than 12 pages will not be accepted due to publisher guidelines. Submissions should include abstract, key words and the e-mail address of the corresponding author. Papers not conforming to these guidelines may be returned without review. All manuscripts will be reviewed and will be judged on correctness, originality, technical strength, significance, quality of presentation, and interest and relevance to the conference attendees. Submitted papers must represent original unpublished research that is not currently under review for any other conference or journal. Papers not following these guidelines will be rejected without review and further action may be taken, including (but not limited to) notifications sent to the heads of the institutions of the authors and sponsors of the conference. Submissions received after the due date or not appropriately structured may also not be considered. The proceedings will be published in Springer's LNCS as post-conference proceedings. At least one author of an accepted paper must register for and attend the workshop for inclusion in the proceedings. Authors may contact the workshop program chairs for more information.
Important websites:
- Resilience 2018 Website: <https://www.csm.ornl.gov/srt/conferences/Resilience/2018 https://www.csm.ornl.gov/srt/conferences/Resilience/2018> - Resilience 2018 Submissions: <https://easychair.org/conferences/?conf=europar2018ws https://easychair.org/conferences/?conf=europar2018ws> - Euro-Par 2018 website: <https://europar2018.org https://europar2018.org/>
Topics of interest include, but are not limited to:
- Theoretical foundations for resilience: - Metrics and measurement - Statistics and optimization - Simulation and emulation - Formal methods - Efficiency modeling and uncertainty quantification
- Fault detection and prediction: - Statistical analyses - Machine learning - Anomaly detection - Data and information collection - Visualization
- Monitoring and control for resilience: - Platform and application monitoring - Response and recovery - RAS theory and performability - Application and platform knobs - Tunable fidelity and quality of service
- End-to-end data integrity: - Fault tolerant design - Degraded modes - Forward migration and verification - Fault injection - Soft errors - Silent data corruption
- Enabling infrastructure for resilience: - RAS systems - System software and middleware - Programming models - Tools - Next-generation architectures
- Resilient solvers and algorithm-based fault tolerance: - Algorithmic detection and correction of hard and soft faults - Resilient algorithms - Fault tolerant numerical methods - Robust iterative algorithms - Scalability of resilient solvers and algorithm-based fault tolerance
Important Dates:
- Workshop papers due: May 4, 2018 - Workshop author notification: June 15, 2018 - Workshop early registration: TBD - Workshop paper (for informal workshop proceedings): July 6, 2018 - Workshop date: August 27-28, 2018 - Workshop camera-ready papers: October 2, 2018
General Co-Chairs:
- Stephen L. Scott Senior Research Scientist - Systems Research Team Tennessee Tech University and Oak Ridge National Laboratory, USA scottsl@ornl.gov mailto:scottsl@ornl.gov - Chokchai (Box) Leangsuksun, SWEPCO Endowed Associate Professor of Computer Science Louisiana Tech University, USA box@latech.edu mailto:box@latech.edu
Program Co-Chairs:
- Patrick G. Bridges University of New Mexico, USA bridges@cs.unm.edu mailto:bridges@cs.unm.edu - Christian Engelmann Oak Ridge National Laboratory , USA engelmannc@ornl.gov mailto:engelmannc@ornl.gov
Program Committee:
- Ferrol Aderholdt, Oak Ridge National Laboratory, USA - Rizwan Ashraf, Oak Ridge National Laboratory, USA - Wesley Bland, Intel Corporation, USA - Hans-Joachim Bungartz, Technical University of Munich, Germany - Marc Casas, Barcelona Supercomputer Center, Spain - Zizhong Chen, University of California at Riverside, USA - Robert Clay, Sandia National Laboratories, USA - Miguel Correia, Universidade de Lisboa, Portugal - Nathan DeBardeleben, Los Alamos National Laboratory, USA - James Elliott, Sandia National Laboratories, USA - Kurt Ferreira, Sandia National Laboratories, USA - Saurabh Hukerikar, NVIDIA, USA - Dieter Kranzlmueller, Ludwig-Maximilians University of Munich, Germany - Ignacio Laguna, Lawrence Livermore National Laboratory, USA - Scott Levy, University of New Mexico, USA - Alexander Reinefeld, Zuse Institute Berlin, Germany - Rolf Riesen, Intel Corporation, USA - Yves Robert, ENS Lyon, France - Thomas Ropars, Universite Grenoble Alpes, France - Martin Schulz, Technical University of Munich, Germany - Keita Teranishi, Sandia National Laboratories, USA
--
Christian Engelmann, Ph.D.
R&D Staff Scientist Computer Science Research Group Computer Science and Mathematics Division Oak Ridge National Laboratory
Mail: P.O. Box 2008, Oak Ridge, TN 37831-6173, USA Phone: +1 (865) 574-3132 / Fax: +1 (865) 576-5491 e-Mail: engelmannc@ornl.gov mailto:engelmannc@ornl.gov / Home: www.christian-engelmann.info http://www.christian-engelmann.info/
We apologize if you receive multiple copies of this call for papers. The workshop paper deadline has been extended to May 11, 2018 (no further extensions).
--------------------------------------------------------------------------------
11th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids https://www.csm.ornl.gov/srt/conferences/Resilience/2018
in conjunction with
the 24th International European Conference on Parallel and Distributed Computing (Euro-Par), Turin, Italy August 27 - 31, 2018 https://europar2018.org
Overview:
Resilience is a critical challenge as high performance computing (HPC) systems continue to increase component counts, individual component reliability decreases (such as due to shrinking process technology and near-threshold voltage (NTV) operation), and software complexity increases. Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is essential to ensure the success of the extreme-scale HPC systems, cluster computing environments, Grid computing infrastructures, and Cloud computing services.
While a fault (e.g., a bug or stuck bit) is the cause of an error, its manifestation as a state change is considered an error (e.g., a bad value or incorrect execution), and the transition to an incorrect service is observed as a failure (e.g., an application abort or system crash). A failure in a computing system is typically observed through an application abort or a full/partial service or system outage. A detectable correctable error is often transparently handled by hardware, such as a single bit flip in memory that is protected with single-error correction double-error detection (SECDED) error correcting code (ECC). A detectable uncorrectable error (DUE) typically results in a failure, such as multiple bit flips in the same addressable word that escape SECDED ECC correction, but not detection, and ultimately cause an application abort. An undetectable error (UE) may result in silent data corruption (SDC), e.g., an incorrect application output. There are many other types of hardware and software faults, errors, and failures in computing systems.
Resilience for HPC systems encompasses a wide spectrum of fundamental and applied research and development, including theoretical foundations, fault detection and prediction, monitoring and control, end-to-end data integrity, enabling infrastructure, and resilient solvers and algorithm-based fault tolerance. This workshop brings together experts in the community to further research and development in HPC resilience and to facilitate exchanges across the computational paradigms of extreme-scale HPC, cluster computing, Grid computing, and Cloud computing.
Submission Guidelines:
Authors are invited to submit papers electronically in English in PDF format. Submitted manuscripts should be structured as technical papers and BETWEEN 10 AND 12 PAGES, including figures, tables and references, using Springer's Lecture Notes in Computer Science (LNCS) format at http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0. Papers with less than 10 or more than 12 pages will not be accepted due to publisher guidelines. Submissions should include abstract, key words and the e-mail address of the corresponding author. Papers not conforming to these guidelines may be returned without review. All manuscripts will be reviewed and will be judged on correctness, originality, technical strength, significance, quality of presentation, and interest and relevance to the conference attendees. Submitted papers must represent original unpublished research that is not currently under review for any other conference or journal. Papers not following these guidelines will be rejected without review and further action may be taken, including (but not limited to) notifications sent to the heads of the institutions of the authors and sponsors of the conference. Submissions received after the due date or not appropriately structured may also not be considered. The proceedings will be published in Springer's LNCS as post-conference proceedings. At least one author of an accepted paper must register for and attend the workshop for inclusion in the proceedings. Authors may contact the workshop program chairs for more information.
Important websites:
- Resilience 2018 Website: https://www.csm.ornl.gov/srt/conferences/Resilience/2018 - Resilience 2018 Submissions: https://easychair.org/conferences/?conf=europar2018ws - Euro-Par 2018 website: https://europar2018.org
Topics of interest include, but are not limited to:
- Theoretical foundations for resilience: - Metrics and measurement - Statistics and optimization - Simulation and emulation - Formal methods - Efficiency modeling and uncertainty quantification
- Fault detection and prediction: - Statistical analyses - Machine learning - Anomaly detection - Data and information collection - Visualization
- Monitoring and control for resilience: - Platform and application monitoring - Response and recovery - RAS theory and performability - Application and platform knobs - Tunable fidelity and quality of service
- End-to-end data integrity: - Fault tolerant design - Degraded modes - Forward migration and verification - Fault injection - Soft errors - Silent data corruption
- Enabling infrastructure for resilience: - RAS systems - System software and middleware - Programming models - Tools - Next-generation architectures
- Resilient solvers and algorithm-based fault tolerance: - Algorithmic detection and correction of hard and soft faults - Resilient algorithms - Fault tolerant numerical methods - Robust iterative algorithms - Scalability of resilient solvers and algorithm-based fault tolerance
Important Dates:
- Workshop papers due: May 11, 2018 (no further extensions) - Workshop author notification: June 15, 2018 - Workshop early registration: TBD - Workshop paper (for informal workshop proceedings): July 6, 2018 - Workshop date: August 27-28, 2018 - Workshop camera-ready papers: October 2, 2018
General Co-Chairs:
- Stephen L. Scott Senior Research Scientist - Systems Research Team Tennessee Tech University and Oak Ridge National Laboratory, USA scottsl@ornl.gov - Chokchai (Box) Leangsuksun, SWEPCO Endowed Associate Professor of Computer Science Louisiana Tech University, USA box@latech.edu
Program Co-Chairs:
- Patrick G. Bridges University of New Mexico, USA bridges@cs.unm.edu - Christian Engelmann Oak Ridge National Laboratory , USA engelmannc@ornl.gov
Program Committee:
- Ferrol Aderholdt, Oak Ridge National Laboratory, USA - Rizwan Ashraf, Oak Ridge National Laboratory, USA - Wesley Bland, Intel Corporation, USA - Hans-Joachim Bungartz, Technical University of Munich, Germany - Marc Casas, Barcelona Supercomputer Center, Spain - Zizhong Chen, University of California at Riverside, USA - Robert Clay, Sandia National Laboratories, USA - Miguel Correia, Universidade de Lisboa, Portugal - Nathan DeBardeleben, Los Alamos National Laboratory, USA - James Elliott, Sandia National Laboratories, USA - Kurt Ferreira, Sandia National Laboratories, USA - Saurabh Hukerikar, NVIDIA, USA - Dieter Kranzlmueller, Ludwig-Maximilians University of Munich, Germany - Ignacio Laguna, Lawrence Livermore National Laboratory, USA - Scott Levy, University of New Mexico, USA - Dirk Pflueger, University of Stuttgart, Germany - Alexander Reinefeld, Zuse Institute Berlin, Germany - Rolf Riesen, Intel Corporation, USA - Yves Robert, ENS Lyon, France - Thomas Ropars, Universite Grenoble Alpes, France - Martin Schulz, Technical University of Munich, Germany - Keita Teranishi, Sandia National Laboratories, USA
We apologize if you receive multiple copies of this call for papers. The workshop paper deadline has been extended further to May 15, 2018 (no further extensions).
--------------------------------------------------------------------------------
11th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids <https://www.csm.ornl.gov/srt/conferences/Resilience/2018 https://www.csm.ornl.gov/srt/conferences/Resilience/2018>
in conjunction with
the 24th International European Conference on Parallel and Distributed Computing (Euro-Par), Turin, Italy August 27 - 31, 2018 <https://europar2018.org https://europar2018.org/>
Overview:
Resilience is a critical challenge as high performance computing (HPC) systems continue to increase component counts, individual component reliability decreases (such as due to shrinking process technology and near-threshold voltage (NTV) operation), and software complexity increases. Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is essential to ensure the success of the extreme-scale HPC systems, cluster computing environments, Grid computing infrastructures, and Cloud computing services.
While a fault (e.g., a bug or stuck bit) is the cause of an error, its manifestation as a state change is considered an error (e.g., a bad value or incorrect execution), and the transition to an incorrect service is observed as a failure (e.g., an application abort or system crash). A failure in a computing system is typically observed through an application abort or a full/partial service or system outage. A detectable correctable error is often transparently handled by hardware, such as a single bit flip in memory that is protected with single-error correction double-error detection (SECDED) error correcting code (ECC). A detectable uncorrectable error (DUE) typically results in a failure, such as multiple bit flips in the same addressable word that escape SECDED ECC correction, but not detection, and ultimately cause an application abort. An undetectable error (UE) may result in silent data corruption (SDC), e.g., an incorrect application output. There are many other types of hardware and software faults, errors, and failures in computing systems.
Resilience for HPC systems encompasses a wide spectrum of fundamental and applied research and development, including theoretical foundations, fault detection and prediction, monitoring and control, end-to-end data integrity, enabling infrastructure, and resilient solvers and algorithm-based fault tolerance. This workshop brings together experts in the community to further research and development in HPC resilience and to facilitate exchanges across the computational paradigms of extreme-scale HPC, cluster computing, Grid computing, and Cloud computing.
Submission Guidelines:
Authors are invited to submit papers electronically in English in PDF format. Submitted manuscripts should be structured as technical papers and BETWEEN 10 AND 12 PAGES, including figures, tables and references, using Springer's Lecture Notes in Computer Science (LNCS) format at <http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0 http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0>. Papers with less than 10 or more than 12 pages will not be accepted due to publisher guidelines. Submissions should include abstract, key words and the e-mail address of the corresponding author. Papers not conforming to these guidelines may be returned without review. All manuscripts will be reviewed and will be judged on correctness, originality, technical strength, significance, quality of presentation, and interest and relevance to the conference attendees. Submitted papers must represent original unpublished research that is not currently under review for any other conference or journal. Papers not following these guidelines will be rejected without review and further action may be taken, including (but not limited to) notifications sent to the heads of the institutions of the authors and sponsors of the conference. Submissions received after the due date or not appropriately structured may also not be considered. The proceedings will be published in Springer's LNCS as post-conference proceedings. At least one author of an accepted paper must register for and attend the workshop for inclusion in the proceedings. Authors may contact the workshop program chairs for more information.
Important websites:
- Resilience 2018 Website: <https://www.csm.ornl.gov/srt/conferences/Resilience/2018 https://www.csm.ornl.gov/srt/conferences/Resilience/2018> - Resilience 2018 Submissions: <https://easychair.org/conferences/?conf=europar2018ws https://easychair.org/conferences/?conf=europar2018ws> - Euro-Par 2018 website: <https://europar2018.org https://europar2018.org/>
Topics of interest include, but are not limited to:
- Theoretical foundations for resilience: - Metrics and measurement - Statistics and optimization - Simulation and emulation - Formal methods - Efficiency modeling and uncertainty quantification
- Fault detection and prediction: - Statistical analyses - Machine learning - Anomaly detection - Data and information collection - Visualization
- Monitoring and control for resilience: - Platform and application monitoring - Response and recovery - RAS theory and performability - Application and platform knobs - Tunable fidelity and quality of service
- End-to-end data integrity: - Fault tolerant design - Degraded modes - Forward migration and verification - Fault injection - Soft errors - Silent data corruption
- Enabling infrastructure for resilience: - RAS systems - System software and middleware - Programming models - Tools - Next-generation architectures
- Resilient solvers and algorithm-based fault tolerance: - Algorithmic detection and correction of hard and soft faults - Resilient algorithms - Fault tolerant numerical methods - Robust iterative algorithms - Scalability of resilient solvers and algorithm-based fault tolerance
Important Dates:
- Workshop papers due: May 15, 2018 (no further extensions) - Workshop author notification: June 15, 2018 - Workshop early registration: TBD - Workshop paper (for informal workshop proceedings): July 6, 2018 - Workshop date: August 27-28, 2018 - Workshop camera-ready papers: October 2, 2018
General Co-Chairs:
- Stephen L. Scott Senior Research Scientist - Systems Research Team Tennessee Tech University and Oak Ridge National Laboratory, USA scottsl@ornl.gov mailto:scottsl@ornl.gov - Chokchai (Box) Leangsuksun, SWEPCO Endowed Associate Professor of Computer Science Louisiana Tech University, USA box@latech.edu mailto:box@latech.edu
Program Co-Chairs:
- Patrick G. Bridges University of New Mexico, USA bridges@cs.unm.edu mailto:bridges@cs.unm.edu - Christian Engelmann Oak Ridge National Laboratory , USA engelmannc@ornl.gov mailto:engelmannc@ornl.gov
Program Committee:
- Ferrol Aderholdt, Oak Ridge National Laboratory, USA - Rizwan Ashraf, Oak Ridge National Laboratory, USA - Wesley Bland, Intel Corporation, USA - Hans-Joachim Bungartz, Technical University of Munich, Germany - Marc Casas, Barcelona Supercomputer Center, Spain - Zizhong Chen, University of California at Riverside, USA - Robert Clay, Sandia National Laboratories, USA - Miguel Correia, Universidade de Lisboa, Portugal - Nathan DeBardeleben, Los Alamos National Laboratory, USA - James Elliott, Sandia National Laboratories, USA - Kurt Ferreira, Sandia National Laboratories, USA - Saurabh Hukerikar, NVIDIA, USA - Dieter Kranzlmueller, Ludwig-Maximilians University of Munich, Germany - Ignacio Laguna, Lawrence Livermore National Laboratory, USA - Scott Levy, University of New Mexico, USA - Dirk Pflueger, University of Stuttgart, Germany - Alexander Reinefeld, Zuse Institute Berlin, Germany - Rolf Riesen, Intel Corporation, USA - Yves Robert, ENS Lyon, France - Thomas Ropars, Universite Grenoble Alpes, France - Martin Schulz, Technical University of Munich, Germany - Keita Teranishi, Sandia National Laboratories, USA
We apologize if you receive multiple copies of this call for papers. The workshop paper deadline has been extended further to May 27, 2018.
--------------------------------------------------------------------------------
11th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids <https://www.csm.ornl.gov/srt/conferences/Resilience/2018 https://www.csm.ornl.gov/srt/conferences/Resilience/2018>
in conjunction with
the 24th International European Conference on Parallel and Distributed Computing (Euro-Par), Turin, Italy August 27 - 31, 2018 <https://europar2018.org https://europar2018.org/>
Overview:
Resilience is a critical challenge as high performance computing (HPC) systems continue to increase component counts, individual component reliability decreases (such as due to shrinking process technology and near-threshold voltage (NTV) operation), and software complexity increases. Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is essential to ensure the success of the extreme-scale HPC systems, cluster computing environments, Grid computing infrastructures, and Cloud computing services.
While a fault (e.g., a bug or stuck bit) is the cause of an error, its manifestation as a state change is considered an error (e.g., a bad value or incorrect execution), and the transition to an incorrect service is observed as a failure (e.g., an application abort or system crash). A failure in a computing system is typically observed through an application abort or a full/partial service or system outage. A detectable correctable error is often transparently handled by hardware, such as a single bit flip in memory that is protected with single-error correction double-error detection (SECDED) error correcting code (ECC). A detectable uncorrectable error (DUE) typically results in a failure, such as multiple bit flips in the same addressable word that escape SECDED ECC correction, but not detection, and ultimately cause an application abort. An undetectable error (UE) may result in silent data corruption (SDC), e.g., an incorrect application output. There are many other types of hardware and software faults, errors, and failures in computing systems.
Resilience for HPC systems encompasses a wide spectrum of fundamental and applied research and development, including theoretical foundations, fault detection and prediction, monitoring and control, end-to-end data integrity, enabling infrastructure, and resilient solvers and algorithm-based fault tolerance. This workshop brings together experts in the community to further research and development in HPC resilience and to facilitate exchanges across the computational paradigms of extreme-scale HPC, cluster computing, Grid computing, and Cloud computing.
Submission Guidelines:
Authors are invited to submit papers electronically in English in PDF format. Submitted manuscripts should be structured as technical papers and BETWEEN 10 AND 12 PAGES, including figures, tables and references, using Springer's Lecture Notes in Computer Science (LNCS) format at <http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0 http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0>. Papers with less than 10 or more than 12 pages will not be accepted due to publisher guidelines. Submissions should include abstract, key words and the e-mail address of the corresponding author. Papers not conforming to these guidelines may be returned without review. All manuscripts will be reviewed and will be judged on correctness, originality, technical strength, significance, quality of presentation, and interest and relevance to the conference attendees. Submitted papers must represent original unpublished research that is not currently under review for any other conference or journal. Papers not following these guidelines will be rejected without review and further action may be taken, including (but not limited to) notifications sent to the heads of the institutions of the authors and sponsors of the conference. Submissions received after the due date or not appropriately structured may also not be considered. The proceedings will be published in Springer's LNCS as post-conference proceedings. At least one author of an accepted paper must register for and attend the workshop for inclusion in the proceedings. Authors may contact the workshop program chairs for more information.
Important websites:
- Resilience 2018 Website: <https://www.csm.ornl.gov/srt/conferences/Resilience/2018 https://www.csm.ornl.gov/srt/conferences/Resilience/2018> - Resilience 2018 Submissions: <https://easychair.org/conferences/?conf=europar2018ws https://easychair.org/conferences/?conf=europar2018ws> - Euro-Par 2018 website: <https://europar2018.org https://europar2018.org/>
Topics of interest include, but are not limited to:
- Theoretical foundations for resilience: - Metrics and measurement - Statistics and optimization - Simulation and emulation - Formal methods - Efficiency modeling and uncertainty quantification
- Fault detection and prediction: - Statistical analyses - Machine learning - Anomaly detection - Data and information collection - Visualization
- Monitoring and control for resilience: - Platform and application monitoring - Response and recovery - RAS theory and performability - Application and platform knobs - Tunable fidelity and quality of service
- End-to-end data integrity: - Fault tolerant design - Degraded modes - Forward migration and verification - Fault injection - Soft errors - Silent data corruption
- Enabling infrastructure for resilience: - RAS systems - System software and middleware - Programming models - Tools - Next-generation architectures
- Resilient solvers and algorithm-based fault tolerance: - Algorithmic detection and correction of hard and soft faults - Resilient algorithms - Fault tolerant numerical methods - Robust iterative algorithms - Scalability of resilient solvers and algorithm-based fault tolerance
Important Dates:
- Workshop papers due: May 27, 2018 (extended) - Workshop author notification: June 25, 2018 - Workshop early registration: TBD - Workshop paper (for informal workshop proceedings): July 6, 2018 - Workshop date: August 27-28, 2018 - Workshop camera-ready papers: October 2, 2018
General Co-Chairs:
- Stephen L. Scott Senior Research Scientist - Systems Research Team Tennessee Tech University and Oak Ridge National Laboratory, USA scottsl@ornl.gov mailto:scottsl@ornl.gov - Chokchai (Box) Leangsuksun, SWEPCO Endowed Associate Professor of Computer Science Louisiana Tech University, USA box@latech.edu mailto:box@latech.edu
Program Co-Chairs:
- Patrick G. Bridges University of New Mexico, USA bridges@cs.unm.edu mailto:bridges@cs.unm.edu - Christian Engelmann Oak Ridge National Laboratory , USA engelmannc@ornl.gov mailto:engelmannc@ornl.gov
Program Committee:
- Ferrol Aderholdt, Oak Ridge National Laboratory, USA - Rizwan Ashraf, Oak Ridge National Laboratory, USA - Wesley Bland, Intel Corporation, USA - Hans-Joachim Bungartz, Technical University of Munich, Germany - Marc Casas, Barcelona Supercomputer Center, Spain - Zizhong Chen, University of California at Riverside, USA - Robert Clay, Sandia National Laboratories, USA - Miguel Correia, Universidade de Lisboa, Portugal - Nathan DeBardeleben, Los Alamos National Laboratory, USA - James Elliott, Sandia National Laboratories, USA - Kurt Ferreira, Sandia National Laboratories, USA - Saurabh Hukerikar, NVIDIA, USA - Dieter Kranzlmueller, Ludwig-Maximilians University of Munich, Germany - Ignacio Laguna, Lawrence Livermore National Laboratory, USA - Scott Levy, University of New Mexico, USA - Dirk Pflueger, University of Stuttgart, Germany - Alexander Reinefeld, Zuse Institute Berlin, Germany - Rolf Riesen, Intel Corporation, USA - Yves Robert, ENS Lyon, France - Thomas Ropars, Universite Grenoble Alpes, France - Martin Schulz, Technical University of Munich, Germany - Keita Teranishi, Sandia National Laboratories, USA
--
Christian Engelmann, Ph.D.
R&D Staff Scientist Computer Science Research Group Computer Science and Mathematics Division Oak Ridge National Laboratory
Mail: P.O. Box 2008, Oak Ridge, TN 37831-6173, USA Phone: +1 (865) 574-3132 / Fax: +1 (865) 576-5491 e-Mail: engelmannc@ornl.gov mailto:engelmannc@ornl.gov / Home: www.christian-engelmann.info http://www.christian-engelmann.info/
We apologize if you receive multiple copies of this call for papers.
--------------------------------------------------------------------------------
12th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids https://www.csm.ornl.gov/srt/conferences/Resilience/2019
in conjunction with
the 25th International European Conference on Parallel and Distributed Computing (Euro-Par), Göttingen, Germany August 26 - 30, 2019 http://2019.euro-par.org
Overview:
Resilience is a critical challenge as high performance computing (HPC) systems continue to increase component counts, individual component reliability decreases (such as due to shrinking process technology and near-threshold voltage (NTV) operation), software complexity increases, and architectures become more heterogeneous. Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is essential to ensure the success of the extreme-scale HPC systems, cluster computing environments, Grid computing infrastructures, and Cloud computing services.
While a fault (e.g., a bug or stuck bit) is the cause of an error, its manifestation as a state change is considered an error (e.g., a bad value or incorrect execution), and the transition to an incorrect service is observed as a failure (e.g., an application abort or system crash). A failure in a computing system is typically observed through an application abort or a full/partial service or system outage. A detectable correctable error is often transparently handled by hardware, such as a single bit flip in memory that is protected with single-error correction double-error detection (SECDED) error correcting code (ECC). A detectable uncorrectable error (DUE) typically results in a failure, such as multiple bit flips in the same addressable word that escape SECDED ECC correction, but not detection, and ultimately cause an application abort. An undetectable error (UE) may result in silent data corruption (SDC), e.g., an incorrect application output. There are many other types of hardware and software faults, errors, and failures in computing systems.
Resilience for HPC systems encompasses a wide spectrum of fundamental and applied research and development, including theoretical foundations, fault detection and prediction, monitoring and control, end-to-end data integrity, enabling infrastructure, and resilient solvers and algorithm-based fault tolerance. This workshop brings together experts in the community to further research and development in HPC resilience and to facilitate exchanges across the computational paradigms of extreme-scale HPC, cluster computing, Grid computing, and Cloud computing.
Submission Guidelines:
Authors are invited to submit papers electronically in English in PDF format. Submitted manuscripts should be structured as technical papers and BETWEEN 10 AND 12 PAGES, including figures, tables and references, using Springer's Lecture Notes in Computer Science (LNCS) format at http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0. Papers with less than 10 or more than 12 pages will not be accepted due to publisher guidelines. Submissions should include abstract, key words and the e-mail address of the corresponding author. Papers not conforming to these guidelines may be returned without review. All manuscripts will be reviewed and will be judged on correctness, originality, technical strength, significance, quality of presentation, and interest and relevance to the conference attendees. Submitted papers must represent original unpublished research that is not currently under review for any other conference or journal. Papers not following these guidelines will be rejected without review and further action may be taken, including (but not limited to) notifications sent to the heads of the institutions of the authors and sponsors of the conference. Submissions received after the due date or not appropriately structured may also not be considered. The proceedings will be published in Springer's LNCS as post-conference proceedings. At least one author of an accepted paper must register for and attend the workshop for inclusion in the proceedings. Authors may contact the workshop program chairs for more information.
Important websites:
- Resilience 2019 Website: https://www.csm.ornl.gov/srt/conferences/Resilience/2019 - Resilience 2019 Submissions: https://easychair.org/my/conference.cgi?conf=europar2019workshops - Euro-Par 2019 website: http://2019.euro-par.org
Topics of interest include, but are not limited to:
- Theoretical foundations for resilience: - Metrics and measurement - Statistics and optimization - Simulation and emulation - Formal methods - Efficiency modeling and uncertainty quantification
- Fault detection and prediction: - Statistical analyses - Machine learning - Anomaly detection - Data and information collection - Visualization
- Monitoring and control for resilience: - Platform and application monitoring - Response and recovery - RAS theory and performability - Application and platform knobs - Tunable fidelity and quality of service
- End-to-end data integrity: - Fault tolerant design - Degraded modes - Forward migration and verification - Fault injection - Soft errors - Silent data corruption
- Enabling infrastructure for resilience: - RAS systems - System software and middleware - Programming models - Tools - Next-generation architectures, including heterogeneous architectures
- Resilient solvers and algorithm-based fault tolerance: - Algorithmic detection and correction of hard and soft faults - Resilient algorithms - Fault tolerant numerical methods - Robust iterative algorithms - Scalability of resilient solvers and algorithm-based fault tolerance
Important Dates:
- Workshop papers due: May 10, 2019 - Workshop author notification: June 28, 2019 - Workshop author registration: July 15, 2019 - Workshop paper (for informal workshop proceedings): July 22, 2019 - Workshop date: August 26 or 27, 2019 - Workshop camera-ready papers: TBD (after the conference)
General Co-Chairs:
- Stephen L. Scott Senior Research Scientist - Systems Research Team Tennessee Tech University and Oak Ridge National Laboratory, USA scottsl@ornl.gov - Chokchai (Box) Leangsuksun, SWEPCO Endowed Associate Professor of Computer Science Louisiana Tech University, USA box@latech.edu
Program Co-Chairs:
- Patrick G. Bridges University of New Mexico, USA bridges@cs.unm.edu - Christian Engelmann Oak Ridge National Laboratory , USA engelmannc@ornl.gov
Program Committee:
- Rizwan Ashraf, Oak Ridge National Laboratory, USA - Wesley Bland, Intel Corporation, USA - Hans-Joachim Bungartz, Technical University of Munich, Germany - Marc Casas, Barcelona Supercomputer Center, Spain - Robert Clay, Sandia National Laboratories, USA - Nathan DeBardeleben, Los Alamos National Laboratory, USA - James Elliott, Sandia National Laboratories, USA - Kurt Ferreira, Sandia National Laboratories, USA - Saurabh Hukerikar, NVIDIA, USA - Dieter Kranzlmueller, Ludwig-Maximilians University of Munich, Germany - Ignacio Laguna, Lawrence Livermore National Laboratory, USA - Scott Levy, University of New Mexico, USA - Dirk Pflueger, University of Stuttgart, Germany - Alexander Reinefeld, Zuse Institute Berlin, Germany - Rolf Riesen, Intel Corporation, USA - Yves Robert, ENS Lyon, France - Thomas Ropars, Universite Grenoble Alpes, France - Martin Schulz, Technical University of Munich, Germany - Keita Teranishi, Sandia National Laboratories, USA
--
Christian Engelmann, Ph.D.
Senior R&D Staff Scientist Computer Science Research Group Computer Science and Mathematics Division Oak Ridge National Laboratory
Mail: P.O. Box 2008, Oak Ridge, TN 37831-6173, USA Phone: +1 (865) 574-3132 / Fax: +1 (865) 576-5491 e-Mail: engelmannc@ornl.gov / Home: www.christian-engelmann.info
We apologize if you receive multiple copies of this call for papers.
--------------------------------------------------------------------------------
12th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids <https://www.csm.ornl.gov/srt/conferences/Resilience/2019 https://www.csm.ornl.gov/srt/conferences/Resilience/2019>
in conjunction with
the 25th International European Conference on Parallel and Distributed Computing (Euro-Par), Göttingen, Germany August 26 - 30, 2019 <http://2019.euro-par.org http://2019.euro-par.org/>
Overview:
Resilience is a critical challenge as high performance computing (HPC) systems continue to increase component counts, individual component reliability decreases (such as due to shrinking process technology and near-threshold voltage (NTV) operation), software complexity increases, and architectures become more heterogeneous. Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is essential to ensure the success of the extreme-scale HPC systems, cluster computing environments, Grid computing infrastructures, and Cloud computing services.
While a fault (e.g., a bug or stuck bit) is the cause of an error, its manifestation as a state change is considered an error (e.g., a bad value or incorrect execution), and the transition to an incorrect service is observed as a failure (e.g., an application abort or system crash). A failure in a computing system is typically observed through an application abort or a full/partial service or system outage. A detectable correctable error is often transparently handled by hardware, such as a single bit flip in memory that is protected with single-error correction double-error detection (SECDED) error correcting code (ECC). A detectable uncorrectable error (DUE) typically results in a failure, such as multiple bit flips in the same addressable word that escape SECDED ECC correction, but not detection, and ultimately cause an application abort. An undetectable error (UE) may result in silent data corruption (SDC), e.g., an incorrect application output. There are many other types of hardware and software faults, errors, and failures in computing systems.
Resilience for HPC systems encompasses a wide spectrum of fundamental and applied research and development, including theoretical foundations, fault detection and prediction, monitoring and control, end-to-end data integrity, enabling infrastructure, and resilient solvers and algorithm-based fault tolerance. This workshop brings together experts in the community to further research and development in HPC resilience and to facilitate exchanges across the computational paradigms of extreme-scale HPC, cluster computing, Grid computing, and Cloud computing.
Submission Guidelines:
Authors are invited to submit papers electronically in English in PDF format. Submitted manuscripts should be structured as technical papers and BETWEEN 10 AND 12 PAGES, including figures, tables and references, using Springer's Lecture Notes in Computer Science (LNCS) format at <http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0 http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0>. Papers with less than 10 or more than 12 pages will not be accepted due to publisher guidelines. Submissions should include abstract, key words and the e-mail address of the corresponding author. Papers not conforming to these guidelines may be returned without review. All manuscripts will be reviewed and will be judged on correctness, originality, technical strength, significance, quality of presentation, and interest and relevance to the conference attendees. Submitted papers must represent original unpublished research that is not currently under review for any other conference or journal. Papers not following these guidelines will be rejected without review and further action may be taken, including (but not limited to) notifications sent to the heads of the institutions of the authors and sponsors of the conference. Submissions received after the due date or not appropriately structured may also not be considered. The proceedings will be published in Springer's LNCS as post-conference proceedings. At least one author of an accepted paper must register for and attend the workshop for inclusion in the proceedings. Authors may contact the workshop program chairs for more information.
Important websites:
- Resilience 2019 Website: <https://www.csm.ornl.gov/srt/conferences/Resilience/2019 https://www.csm.ornl.gov/srt/conferences/Resilience/2019> - Resilience 2019 Submissions: <https://easychair.org/my/conference.cgi?conf=europar2019workshops https://easychair.org/my/conference.cgi?conf=europar2019workshops> - Euro-Par 2019 website: <http://2019.euro-par.org http://2019.euro-par.org/>
Topics of interest include, but are not limited to:
- Theoretical foundations for resilience: - Metrics and measurement - Statistics and optimization - Simulation and emulation - Formal methods - Efficiency modeling and uncertainty quantification
- Fault detection and prediction: - Statistical analyses - Machine learning - Anomaly detection - Data and information collection - Visualization
- Monitoring and control for resilience: - Platform and application monitoring - Response and recovery - RAS theory and performability - Application and platform knobs - Tunable fidelity and quality of service
- End-to-end data integrity: - Fault tolerant design - Degraded modes - Forward migration and verification - Fault injection - Soft errors - Silent data corruption
- Enabling infrastructure for resilience: - RAS systems - System software and middleware - Programming models - Tools - Next-generation architectures, including heterogeneous architectures
- Resilient solvers and algorithm-based fault tolerance: - Algorithmic detection and correction of hard and soft faults - Resilient algorithms - Fault tolerant numerical methods - Robust iterative algorithms - Scalability of resilient solvers and algorithm-based fault tolerance
Important Dates:
- Workshop papers due: May 10, 2019 - Workshop author notification: June 28, 2019 - Workshop author registration: July 15, 2019 - Workshop paper (for informal workshop proceedings): July 22, 2019 - Workshop date: August 26 or 27, 2019 - Workshop camera-ready papers: TBD (after the conference)
General Co-Chairs:
- Stephen L. Scott Senior Research Scientist - Systems Research Team Tennessee Tech University and Oak Ridge National Laboratory, USA scottsl@ornl.gov mailto:scottsl@ornl.gov - Chokchai (Box) Leangsuksun, SWEPCO Endowed Associate Professor of Computer Science Louisiana Tech University, USA box@latech.edu mailto:box@latech.edu
Program Co-Chairs:
- Patrick G. Bridges University of New Mexico, USA bridges@cs.unm.edu mailto:bridges@cs.unm.edu - Christian Engelmann Oak Ridge National Laboratory , USA engelmannc@ornl.gov mailto:engelmannc@ornl.gov
Program Committee:
- Rizwan Ashraf, Oak Ridge National Laboratory, USA - Wesley Bland, Intel Corporation, USA - Hans-Joachim Bungartz, Technical University of Munich, Germany - Marc Casas, Barcelona Supercomputer Center, Spain - Robert Clay, Sandia National Laboratories, USA - Nathan DeBardeleben, Los Alamos National Laboratory, USA - James Elliott, Sandia National Laboratories, USA - Kurt Ferreira, Sandia National Laboratories, USA - Saurabh Hukerikar, NVIDIA, USA - Dieter Kranzlmueller, Ludwig-Maximilians University of Munich, Germany - Ignacio Laguna, Lawrence Livermore National Laboratory, USA - Scott Levy, University of New Mexico, USA - Dirk Pflueger, University of Stuttgart, Germany - Alexander Reinefeld, Zuse Institute Berlin, Germany - Rolf Riesen, Intel Corporation, USA - Yves Robert, ENS Lyon, France - Thomas Ropars, Universite Grenoble Alpes, France - Martin Schulz, Technical University of Munich, Germany - Keita Teranishi, Sandia National Laboratories, USA
--
Christian Engelmann, Ph.D.
Senior R&D Staff Scientist Computer Science Research Group Computer Science and Mathematics Division Oak Ridge National Laboratory
Mail: P.O. Box 2008, Oak Ridge, TN 37831-6173, USA Phone: +1 (865) 574-3132 / Fax: +1 (865) 576-5491 e-Mail: engelmannc@ornl.gov mailto:engelmannc@ornl.gov / Home: www.christian-engelmann.info http://www.christian-engelmann.info/
We apologize if you receive multiple copies of this call for papers. The submission deadline has been extended to May 24. This is a firm deadline.
--------------------------------------------------------------------------------
12th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids https://www.csm.ornl.gov/srt/conferences/Resilience/2019
in conjunction with
the 25th International European Conference on Parallel and Distributed Computing (Euro-Par), Göttingen, Germany August 26 - 30, 2019 <http://2019.euro-par.orghttp://2019.euro-par.org/>
Overview:
Resilience is a critical challenge as high performance computing (HPC) systems continue to increase component counts, individual component reliability decreases (such as due to shrinking process technology and near-threshold voltage (NTV) operation), software complexity increases, and architectures become more heterogeneous. Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is essential to ensure the success of the extreme-scale HPC systems, cluster computing environments, Grid computing infrastructures, and Cloud computing services.
While a fault (e.g., a bug or stuck bit) is the cause of an error, its manifestation as a state change is considered an error (e.g., a bad value or incorrect execution), and the transition to an incorrect service is observed as a failure (e.g., an application abort or system crash). A failure in a computing system is typically observed through an application abort or a full/partial service or system outage. A detectable correctable error is often transparently handled by hardware, such as a single bit flip in memory that is protected with single-error correction double-error detection (SECDED) error correcting code (ECC). A detectable uncorrectable error (DUE) typically results in a failure, such as multiple bit flips in the same addressable word that escape SECDED ECC correction, but not detection, and ultimately cause an application abort. An undetectable error (UE) may result in silent data corruption (SDC), e.g., an incorrect application output. There are many other types of hardware and software faults, errors, and failures in computing systems.
Resilience for HPC systems encompasses a wide spectrum of fundamental and applied research and development, including theoretical foundations, fault detection and prediction, monitoring and control, end-to-end data integrity, enabling infrastructure, and resilient solvers and algorithm-based fault tolerance. This workshop brings together experts in the community to further research and development in HPC resilience and to facilitate exchanges across the computational paradigms of extreme-scale HPC, cluster computing, Grid computing, and Cloud computing.
Submission Guidelines:
Authors are invited to submit papers electronically in English in PDF format. Submitted manuscripts should be structured as technical papers and BETWEEN 10 AND 12 PAGES, including figures, tables and references, using Springer's Lecture Notes in Computer Science (LNCS) format at http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0. Papers with less than 10 or more than 12 pages will not be accepted due to publisher guidelines. Submissions should include abstract, key words and the e-mail address of the corresponding author. Papers not conforming to these guidelines may be returned without review. All manuscripts will be reviewed and will be judged on correctness, originality, technical strength, significance, quality of presentation, and interest and relevance to the conference attendees. Submitted papers must represent original unpublished research that is not currently under review for any other conference or journal. Papers not following these guidelines will be rejected without review and further action may be taken, including (but not limited to) notifications sent to the heads of the institutions of the authors and sponsors of the conference. Submissions received after the due date or not appropriately structured may also not be considered. The proceedings will be published in Springer's LNCS as post-conference proceedings. At least one author of an accepted paper must register for and attend the workshop for inclusion in the proceedings. Authors may contact the workshop program chairs for more information.
Important websites:
- Resilience 2019 Website: https://www.csm.ornl.gov/srt/conferences/Resilience/2019 - Resilience 2019 Submissions: https://easychair.org/my/conference.cgi?conf=europar2019workshops - Euro-Par 2019 website: <http://2019.euro-par.orghttp://2019.euro-par.org/>
Topics of interest include, but are not limited to:
- Theoretical foundations for resilience: - Metrics and measurement - Statistics and optimization - Simulation and emulation - Formal methods - Efficiency modeling and uncertainty quantification
- Fault detection and prediction: - Statistical analyses - Machine learning - Anomaly detection - Data and information collection - Visualization
- Monitoring and control for resilience: - Platform and application monitoring - Response and recovery - RAS theory and performability - Application and platform knobs - Tunable fidelity and quality of service
- End-to-end data integrity: - Fault tolerant design - Degraded modes - Forward migration and verification - Fault injection - Soft errors - Silent data corruption
- Enabling infrastructure for resilience: - RAS systems - System software and middleware - Programming models - Tools - Next-generation architectures, including heterogeneous architectures
- Resilient solvers and algorithm-based fault tolerance: - Algorithmic detection and correction of hard and soft faults - Resilient algorithms - Fault tolerant numerical methods - Robust iterative algorithms - Scalability of resilient solvers and algorithm-based fault tolerance
Important Dates:
- Workshop papers due: Extended to May 24, 2019 (firm) - Workshop author notification: June 28, 2019 - Workshop author registration: July 15, 2019 - Workshop paper (for informal workshop proceedings): July 22, 2019 - Workshop date: August 26 or 27, 2019 - Workshop camera-ready papers: TBD (after the conference)
General Co-Chairs:
- Stephen L. Scott Senior Research Scientist - Systems Research Team Tennessee Tech University and Oak Ridge National Laboratory, USA scottsl@ornl.govmailto:scottsl@ornl.gov - Chokchai (Box) Leangsuksun, SWEPCO Endowed Associate Professor of Computer Science Louisiana Tech University, USA box@latech.edumailto:box@latech.edu
Program Co-Chairs:
- Patrick G. Bridges University of New Mexico, USA bridges@cs.unm.edumailto:bridges@cs.unm.edu - Christian Engelmann Oak Ridge National Laboratory , USA engelmannc@ornl.govmailto:engelmannc@ornl.gov
Program Committee:
- Rizwan Ashraf, Oak Ridge National Laboratory, USA - Wesley Bland, Intel Corporation, USA - Hans-Joachim Bungartz, Technical University of Munich, Germany - Marc Casas, Barcelona Supercomputer Center, Spain - Robert Clay, Sandia National Laboratories, USA - Nathan DeBardeleben, Los Alamos National Laboratory, USA - James Elliott, Sandia National Laboratories, USA - Kurt Ferreira, Sandia National Laboratories, USA - Saurabh Hukerikar, NVIDIA, USA - Dieter Kranzlmueller, Ludwig-Maximilians University of Munich, Germany - Ignacio Laguna, Lawrence Livermore National Laboratory, USA - Scott Levy, University of New Mexico, USA - Dirk Pflueger, University of Stuttgart, Germany - Alexander Reinefeld, Zuse Institute Berlin, Germany - Rolf Riesen, Intel Corporation, USA - Yves Robert, ENS Lyon, France - Thomas Ropars, Universite Grenoble Alpes, France - Martin Schulz, Technical University of Munich, Germany - Keita Teranishi, Sandia National Laboratories, USA
--
Christian Engelmann, Ph.D.
Senior R&D Staff Scientist Computer Science Research Group Computer Science and Mathematics Division Oak Ridge National Laboratory
Mail: P.O. Box 2008, Oak Ridge, TN 37831-6173, USA Phone: +1 (865) 574-3132 / Fax: +1 (865) 576-5491 e-Mail: engelmannc@ornl.govmailto:engelmannc@ornl.gov / Home: www.christian-engelmann.infohttp://www.christian-engelmann.info/
computational.science@lists.iccsa.org