Lab 6：基于容器服务VKE运行Tensorflow实验 - 文章 - 开发者社区

实验说明

本实验基于火山引擎容器服务VKE进行，其中涉及到其他产品，如托管Prometheus进行监控，需要前置创建好VMP的workspace，使用TOS（后续实验考虑替换为vePFS）存储数据集，也需要提前创建好TOS Bucket。

本示例将训练一个神经网络模型，对运动鞋和衬衫等服装图像进行分类。本实验将介绍如何在容器服务VKE中运行TensorFlow，并查看GPU监控情况。

Task 1：配置对象存储TOS

配置对象存储TOS。

picture.image

Task 2：添加GPU节点

在 VKE 集群中创建节点池。

输入节点池名称，比如“tf-nodepool-zhangsan2022”
实例类型选择 GPU 计算型：ecs.g1te.2xlarge

picture.image

输入 root 用户密码，其他保持默认。

picture.image

安装部署GPU组件，如果已经进行部署，可以忽略该步骤。

picture.image

登录对应的GPU节点，确认GPU机器本身正常，使用命令 nvidia-smi 查看GPU情况。

Tips：
请提交本步骤实验结果截图。

picture.image

Task 3：检查 Prometheus 监控

如果没有配置 Prometheus 监控，可参考基于 VMP 实现 VKE 集群监控实验。

登录容器服务控制台，选择运维管理，选择 Prometheus监控。检查 Prometheus 监控正常运行。

picture.image

Task 4：准备TensorFlow 的数据集

从https://github.com/zalandoresearch/fashion-mnist下载数据。下载如下四个压缩包
如果 github 网络访问较慢，可从点击如下链接下载。（已提前上传到火山引擎 Tos）
**t10k-images-idx3-ubyte.gz
**t10k-labels-idx1-ubyte.gz
**train-images-idx3-ubyte.gz
**train-labels-idx1-ubyte.gz

picture.image

该数据集包含 10 个类别的 70,000 个灰度图像。这些图像以低分辨率（28x28 像素）展示了单件衣物，如下所示：

picture.image

在创建好的TOS Bucket下创建名为TensorFlow的目录，在该目录下创建两个子目录，名为img和data。

picture.image

上传前序步骤下载的4个压缩包到data目录下。

Tips：
请提交本步骤实验结果截图。

picture.image

获取TensorFlow的ML范例代码，并上传到TOS的TensorFlow目录下。

import tensorflow as tf
from tensorflow import keras

# Helper libraries
import numpy as np
import gzip
from tensorflow.python.keras.utils import get_file
import matplotlib as mpl
mpl.use('Agg')
import matplotlib.pyplot as plt

print(tf.__version__)

#fashion_mnist = keras.datasets.fashion_mnist
#(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

def load_data():
    base = 'file:////home/data/'
    files = [
        'train-labels-idx1-ubyte.gz', 'train-images-idx3-ubyte.gz',
        't10k-labels-idx1-ubyte.gz', 't10k-images-idx3-ubyte.gz'
    ]

    paths = []
    for fname in files:
        paths.append(get_file(fname, origin=base + fname))

    with gzip.open(paths[0], 'rb') as lbpath:
        y_train = np.frombuffer(lbpath.read(), np.uint8, offset=8)

    with gzip.open(paths[1], 'rb') as imgpath:
        x_train = np.frombuffer(
            imgpath.read(), np.uint8, offset=16).reshape(len(y_train), 28, 28)

    with gzip.open(paths[2], 'rb') as lbpath:
        y_test = np.frombuffer(lbpath.read(), np.uint8, offset=8)

    with gzip.open(paths[3], 'rb') as imgpath:
        x_test = np.frombuffer(
            imgpath.read(), np.uint8, offset=16).reshape(len(y_test), 28, 28)

    return (x_train, y_train), (x_test, y_test)

(train_images, train_labels), (test_images, test_labels) = load_data()

class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

plt.figure()
plt.imshow(train_images[0])
plt.colorbar()
plt.grid(False)
plt.savefig('/home/img/basicimg1.png')

train_images = train_images / 255.0

test_images = test_images / 255.0

plt.figure(figsize=(10,10))
for i in range(25):
    plt.subplot(5,5,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(train_images[i], cmap=plt.cm.binary)
    plt.xlabel(class_names[train_labels[i]])
plt.savefig('/home/img/basicimg2.png')

model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dense(128, activation=tf.nn.relu),
    keras.layers.Dense(10, activation=tf.nn.softmax)
])

model.compile(optimizer=tf.train.AdamOptimizer(), 
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(train_images, train_labels, epochs=5)

test_loss, test_acc = model.evaluate(test_images, test_labels)

print('Test accuracy:', test_acc)

predictions = model.predict(test_images)

def plot_image(i, predictions_array, true_label, img):
  predictions_array, true_label, img = predictions_array[i], true_label[i], img[i]
  plt.grid(False)
  plt.xticks([])
  plt.yticks([])

  plt.imshow(img, cmap=plt.cm.binary)

  predicted_label = np.argmax(predictions_array)
  if predicted_label == true_label:
    color = 'blue'
  else:
    color = 'red'

  plt.xlabel('{} {:2.0f}% ({})'.format(class_names[predicted_label],
                                100*np.max(predictions_array),
                                class_names[true_label]),
                                color=color)

def plot_value_array(i, predictions_array, true_label):
  predictions_array, true_label = predictions_array[i], true_label[i]
  plt.grid(False)
  plt.xticks([])
  plt.yticks([])
  thisplot = plt.bar(range(10), predictions_array, color='#777777')
  plt.ylim([0, 1]) 
  predicted_label = np.argmax(predictions_array)

  thisplot[predicted_label].set_color('red')
  thisplot[true_label].set_color('blue')

i = 0
plt.figure(figsize=(6,3))
plt.subplot(1,2,1)
plot_image(i, predictions, test_labels, test_images)
plt.subplot(1,2,2)
plot_value_array(i, predictions,  test_labels)
plt.savefig('/home/img/basicimg3.png')

i = 12
plt.figure(figsize=(6,3))
plt.subplot(1,2,1)
plot_image(i, predictions, test_labels, test_images)
plt.subplot(1,2,2)
plot_value_array(i, predictions,  test_labels)
plt.savefig('/home/img/basicimg4.png')

# Plot the first X test images, their predicted label, and the true label
# Color correct predictions in blue, incorrect predictions in red
num_rows = 5
num_cols = 3
num_images = num_rows*num_cols
plt.figure(figsize=(2*2*num_cols, 2*num_rows))
for i in range(num_images):
  plt.subplot(num_rows, 2*num_cols, 2*i+1)
  plot_image(i, predictions, test_labels, test_images)
  plt.subplot(num_rows, 2*num_cols, 2*i+2)
  plot_value_array(i, predictions, test_labels)
plt.savefig('/home/img/basicimg5.png')

picture.image

Task 5：运行TensorFlow

Task 5.1 基础配置

本示例使用的namespace均为default。

创建访问 Tos 的秘钥

picture.image

打开 Secret Access Key，并复制 Access Key ID 和 Secret Access Key 留到下一步使用。

picture.image

登录容器服务控制台，使用TOS创建存储卷。

picture.image

配置以下信息：

存储卷名称，比如“tos-pv-zhangsan2022”
存储类型，“对象存储”（点击之后会自动安装 csi-tos 插件，请稍等片刻）
点击创建秘钥

picture.image

在弹出窗口输入秘钥名称和上一步生成的秘钥对

picture.image

配置以下信息：- 选择刚刚创建的访问秘钥

选择前面创建的存储桶
子目录填写 TensorFlow（如果前序步骤在TOS创建的目录不是TensorFlow，需要对应修改）

picture.image

创建存储卷声明。

picture.image

配置以下信息：- 存储卷声明名称，比如“tos-pvc-zhangsan2022”

存储类型选择对象存储
存储卷选择上一步创建的存储卷
点击创建

picture.image

创建运行TensorFlow的Job。可以选择使用yaml创建或者通过控制台创建。

Task 5.2 通过控制台创建

选择工作负载，选择任务，点击创建任务。

picture.image

填写任务名称，比如“tf-job-zhangsan2022”。

picture.image

配置以下信息：

镜像地址：cr-demo-cn-beijing.cr.volces.com/tensorflow/tensorflow
镜像版本选择：1.15.5-gpu-vke
CPU 请求 2 Core
内存请求 4 GiB （注意默认选项为 MiB）
GPU 算力 1 Card
勾选“启用 nvidia 调度”
其他保持默认配置

picture.image

存储配置如下。

picture.image

设置生命周期。

picture.image

启动命令和运行参数。

-c  
time0=$(date "+%s");while((($(date "+%s")-time0)<=240));do python /home/basicClass.py ;done

完成Job创建，等待Job运行完成，查看pod日志。（Job创建的Pod预计会运行4分钟多一点）。

Tips：
请提交本步骤实验结果截图。

picture.image

Task 5.3 使用yaml方式创建

使用yaml方式创建，使用如下yaml。

其中使用的pvc name需要跟前置步骤创建的存储卷声明保持一致，需要对应修改为实际的pvc name。

apiVersion: batch/v1
kind: Job  
metadata:  
  name: tfjob  
  namespace: default  
spec:  
  activeDeadlineSeconds: 300  
  backoffLimit: 6  
  completions: 1  
  parallelism: 1  
  template:  
    spec:  
      containers:  
      - args:  
        - -c  
        - time0=$(date "+%s");while((($(date "+%s")-time0)<=240));do python /home/basicClass.py ;done  
        command:  
        - /bin/bash  
        image: cr-demo-cn-beijing.cr.volces.com/tensorflow/tensorflow:1.15.5-gpu-vke  
        imagePullPolicy: IfNotPresent  
        name: tf  
        resources:  
          limits:  
            nvidia.com/gpu: "1"  
          requests:  
            cpu: "2"  
            memory: 4Gi  
        terminationMessagePath: /dev/termination-log  
        terminationMessagePolicy: File  
        volumeMounts:  
        - mountPath: /home  
          name: tf  
      dnsPolicy: ClusterFirst  
      restartPolicy: Never  
      schedulerName: default-scheduler  
      securityContext: {}  
      terminationGracePeriodSeconds: 30  
      volumes:  
      - name: tf  
        persistentVolumeClaim:  
          claimName: tos-pvc-zhangsan2022 #需要修改成实际的pvc name

查看GPU监控情况。注意需要选择对应Job创建的podname进行查询。

picture.image

登录对象存储TOS控制台，查看img目录下训练的结果情况。结果如下：

Tips：
请提交本步骤实验结果截图。

picture.image

Task 6：提交实验结果

访问表单并提交实验结果截图。实验结果提交表单
恭喜完成实验！