Troubleshooting Workloads on GKE for Site Reliability Engineers

search share Join Sign in

Troubleshooting Workloads on GKE for Site Reliability Engineers

1 hour 30 minutes 5 Credits


Google Cloud Self-Paced Labs


Site Reliability Engineers (SRE) have a broad set of responsibilities, and managing incidents is a critical part of their role. You will learn how to take advantage of the integrated capabilities of Google Cloud's operations suite that includes logging, monitoring, and rich, out-of-the-box dashboards.

The troubleshooting process is an “iterative” approach where SREs form a hypothesis about the potential root cause of an incident, then filter, search, and navigate through large volumes of telemetry data collected from their systems to validate or invalidate their hypothesis. If a hypothesis is invalid, SREs will form another hypothesis and perform another iteration until they can isolate a root cause.

In this lab, you will learn how to navigate that iterative journey efficiently and effectively using Google Cloud's operations tools!

Lab Objectives

In this lab, you will learn how to:

  1. Navigate resource pages of Google Kubernetes Engine (GKE)

  2. Leverage the GKE dashboard to quickly view operational data

  3. Create logs-based metrics to capture specific issues

  4. Create a Service Level Objective (SLO)

  5. Define an Alert to notify SRE staff of incidents

Join Qwiklabs to read the rest of this lab...and more!

  • Get temporary access to the cloud console.
  • Over 200 labs from beginner to advanced levels.
  • Bite-sized so you can learn at your own pace.
Join to Start This Lab