GPU Usage¶
The Container Platform provides access to a worker node with 4 Nvidia Tesla V100 GPUs.
This is a non-exhaustive list of the supported libraries:
CUDA 9.0, 9.1, 9.2
CUDNN 7.1.4
OpenBlas
Tensorflow 1.9
TensorBoard
Keras
Theano
Caffe
Lasagne
Jupyter
Torch7
PyTorch
virtualenv
docker
numpy 1.15
scipy 1.1
scikit-learn
matplotlib
pandas
Cython 0.28
nolearn
Policy¶
Since the number of GPUs that the Container Platform can provide at the moment is limited (4 GPUs) and in order to allow more users to use GPUs, the requested resources can be allocated for a maximum of 8 days.
Specifically, it is possible to request and allocate 1 GPU for a maximum of 8 days, 2 GPU for a maximum of 4 days and 3-4 GPU for a maximum of 2 days.
When the 8 days (4 or 2 days in case the number of GPUs requested is 3-4) expire, the user pod(s)that are using the GPU(s) will be deleted and the GPU(s) resources will be reallocated.
Therefore, we recommend that you make a backup of your most important data before the expiration date so that data won’t be lost as a result of the pod(s) deletion.
Getting a GPU¶
In order to obtain access to one or more GPUs, please send us a request via web portal (Common requests -> Reserve GPU)
Each user request will be queued and satisfied in cronological order as long as the GPUs requested are free and can be allocated.
Users will then receive an email that confirms that they can access the GPU(s) along with informations regarding the time period in which the GPU(s) will be exclusively reserved to the user.
Once the confirmation email has been received, it is sufficient to require the resource nvidia.com/gpu in the Pod deployment and add the tolerations section. The key in the tolerations section can be either ‘vgpu’ or ‘gpu’. For example, to deploy the digits container, put this into file digits.yaml:
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
tolerations:
- key: "vgpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
containers:
- name: digits-container
image: nvcr.io/nvidia/digits:19.12-tensorflow-py3
resources:
limits:
nvidia.com/gpu:
Now you can deploy it with:
$ kubectl create -f digits.yaml
GPU state¶
To get the current status of the GPUs, issue:
$ kubectl exec gpu-pod nvidia-smi
Mon Jul 30 06:13:39 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.37 Driver Version: 396.37 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:82:00.0 Off | 0 |
| N/A 24C P0 35W / 250W | 427MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
...
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Controlling GPU usage¶
Since GPUs are limited and expensive, we invite to use them sparingly. In particular each usere should only use one GPU at a time.
If you are using Tensorflow, ensure to avoid to allocate all GPU memory, by using this option, when creating a session:
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
If you use Keras, you must pass it a session for TensorFlow, using function:
keras.backend.tensorflow_backend.set_session(session)