Autoscaling TensorFlow Model Deployments with TF Serving and Kubernetes

search share 가입 로그인

Autoscaling TensorFlow Model Deployments with TF Serving and Kubernetes

2시간 크레딧 7개


Google Cloud Self-Paced Labs


Serving deep learning models can be especially challenging. The models are often large requiring gigabytes of memory. They are also very compute intensive - a small number of concurrent requests can fully utilize a CPU or GPU. Automatic horizontal scaling is one of the primary strategies used in architecting scalable and reliable model serving infrastructures for deep learning models.

In this lab, you will use TensorFlow Serving and Google Cloud Kubernetes Engine (GKE) to configure a high-performance, autoscalable serving system for TensorFlow models.


In this lab, you will learn how to:

  • Configure a GKE cluster with an autoscaling node pool.

  • Deploy TensorFlow Serving in an autoscalable configuration.

  • Monitor serving performance and resource utilization


To successfully complete the lab you need to have a solid understanding of how to save and load TensorFlow models and a basic familiarity with Kubernetes concepts and architecture. Before proceeding with the lab it is recommended to review the following resources:

Lab scenario

You will use TensorFlow Serving to deploy the ResNet101 model. TensorFlow Serving is a flexible, high-performance serving system for machine learning models, designed for production environments. TensorFlow Serving makes it easy to deploy new algorithms and experiments, while keeping the same server architecture and APIs. TensorFlow Serving provides out-of-the-box integration with TensorFlow models, but can be easily extended to serve other types of models and data.

TensorFlow Serving can be run in a docker container and deployed and managed by Kubernetes. In the lab, you will deploy TensorFlow Serving as a Kubernetes Deployment on Google Cloud Kubernetes Engine (GKE) and use Kubernetes Horizontal Pod Autoscaler to automatically scale the number of TensorFlow Serving replicas based on observed CPU utilization. You will also use GKE Cluster Autoscaler to automatically resize your GKE cluster's node pool based on the resource demands generated by the TensorFlow Serving Deployment.

Horizontal Pod Autoscaler automatically scales the number of Pods in a replication controller, deployment, replica set or stateful set based on observed CPU utilization (or, with custom metrics support, on some other application-provided metrics). Horizontal Pod Autoscaler is implemented as a Kubernetes API resource and a controller. The resource determines the behavior of the controller. The controller periodically adjusts the number of replicas in a replication controller or deployment to match the observed average CPU utilization to the target specified by user.

GKE's Cluster Autoscaler automatically resizes the number of nodes in a given node pool, based on the demands of your workloads. You don't need to manually add or remove nodes or over-provision your node pools. Instead, you specify a minimum and maximum size for the node pool, and the rest is automatic.

After configuring the cluster and deploying TensorFlow Serving you will use an open source load testing tool Locust to generate prediction requests against the ResNet101 model and observe how the model deployment automatically scales up and down based on the load.

Summary of the tasks performed during the lab:

  • Create a GKE cluster with autoscaling enabled on a default node pool

  • Deploy the pretrained ResNet101 model using TensorFlow Serving

  • Configure Horizontal Pod Autoscaler

  • Install Locust

  • Load the ResNet101 model

  • Monitor the model deployment

이 실습의 나머지 부분과 기타 사항에 대해 알아보려면 Qwiklabs에 가입하세요.

  • Google Cloud Console에 대한 임시 액세스 권한을 얻습니다.
  • 초급부터 고급 수준까지 200여 개의 실습이 준비되어 있습니다.
  • 자신의 학습 속도에 맞춰 학습할 수 있도록 적은 분량으로 나누어져 있습니다.
이 실습을 시작하려면 가입하세요