GPU Usage

The Container Platform provides access to a worker node with 4 `Nvidia Tesla V100`GPUs. On the worker node, the following libraries or tools are available:

  • CUDA 9.0, 9.1, 9.2
  • CUDNN 7.1.4
  • OpenBlas
  • Tensorflow 1.9
  • TensorBoard
  • Keras
  • Theano
  • Caffe
  • Lasagne
  • Jupyter
  • Torch7
  • PyTorch
  • virtualenv
  • docker
  • numpy 1.15
  • scipy 1.1
  • scikit-learn
  • matplotlib
  • pandas
  • Cython 0.28
  • nolearn

Getting a GPU

In order to abtain a GPU it is sufficient to require the resurce nvidia.com/gpu in the Pod deployment. For example, to deploy the digits container, put this into file digits.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: digits-container
      image: nvidia/digits:6.0
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPUs

Now you can deploy it with:

$ kubectl create -f digits.yaml

GPU state

To get the current status of the GPUs, issue:

$ kubectl exec gpu-pod nvidia-smi
Mon Jul 30 06:13:39 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.37                 Driver Version: 396.37                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:82:00.0 Off |                    0 |
| N/A   24C    P0    35W / 250W |    427MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
...
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Controlling GPU usage

Since GPUs are limited and expensive, we invite to use them sparingly. In particular each usere should only use one GPU at a time.

If you are using Tensorflow, ensure to avoid to allocate all GPU memory, by using this option, when creating a session:

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)

If you use Keras, you must pass it a session for `TensorFlow, using function:

keras.backend.tensorflow_backend.set_session(session)