As supercomputers continue to grow in scale and complexity, reliability becomes a paramount concern because the mean-time-between-failures, now calculated in hours, is expected to drop to minutes on future machines. Fault detection, a process of identifying when and where a fault has occurred, plays a critical role in fault management.
A team from the IIT Department of Computer Science presented a novel approach to address this challenging problem in their paper “Exploring Void Search for Fault Detection on Extreme Scale Systems,” which won the Best Paper Award at IEEE Cluster 2014, a well-known conference in the field of high-performance computing in September. The team included Zhiling Lan, associate professor of computer science; doctoral students Eduardo Berrocal, Li Yu, and Sean Wallace; and their collaborator, Michael E. Papka at Argonne Leadership Computing Facility. This year, the conference accepted 29 full papers out of 122 valid submissions and awarded one Best Paper Award.
Recognizing the wide deployment of hardware sensors on modern computers to track environmental conditions (e.g., temperature, voltage, current, etc.), the team demonstrates that using environmental data to detect hardware faults is not only possible, but can actually outperform the conventional logs-based methods. Furthermore, the team applies void search algorithms from the field of astrophysics for fault detection in a completely novel way. Void search has been used primarily in astrophysics for studying galaxy formation and evolution since the Universe consists of many underdense regions (i.e., voids). Inspired by the similarity between the feature of voids and the characteristic of anomaly, the team presents a new void search-based detection method using environmental data. They further evaluate the design using real logs from the 10-petaflops IBM Blue Gene/Q system named Mira at Argonne. Their experiments clearly show that this new design is highly effective and outperforms a number of existing detection algorithms.