Create a log metric
Create a Service Level Objective(SLO)
Create an Alert on the Service Level Objective (SLO)
Troubleshooting Workloads on GKE for Site Reliability Engineers
Site Reliability Engineers (SRE) have a broad set of responsibilities, and managing incidents is a critical part of their role. You will learn how to take advantage of the integrated capabilities of Google Cloud's operations suite that includes logging, monitoring, and rich, out-of-the-box dashboards.
The troubleshooting process is an “iterative” approach where SREs form a hypothesis about the potential root cause of an incident, then filter, search, and navigate through large volumes of telemetry data collected from their systems to validate or invalidate their hypothesis. If a hypothesis is invalid, SREs will form another hypothesis and perform another iteration until they can isolate a root cause.
In this lab, you will learn how to navigate that iterative journey efficiently and effectively using Google Cloud's operations tools!
In this lab, you will learn how to:
Navigate resource pages of Google Kubernetes Engine (GKE)
Leverage the GKE dashboard to quickly view operational data
Create logs-based metrics to capture specific issues
Create a Service Level Objective (SLO)
Define an Alert to notify SRE staff of incidents
Join Qwiklabs to read the rest of this lab...and more!
- Get temporary access to the cloud console.
- Over 200 labs from beginner to advanced levels.
- Bite-sized so you can learn at your own pace.