CS 672: System Reliability at Scale

Spring 2024

Announcements

Overview

This course exposes students to reliability challenges and practices in modern large-scale systems, including cloud, data center, and supercomputing platforms. The idea is to provide a strong technical background to pursue research, practical application, and/or further study in building robust systems.

We will look at how systems fail and recover, broadly touching on reliability and security topics and their implications on the sustainability of large-scale computing. We will explore relevant case studies centered on current challenges for production systems, reviewing both state-of-the-art techniques and recent academic proposals.