Skip to content

mfoco/deepops

 
 

Repository files navigation

DeepOps

Deploy a scalable DGX cluster on-prem or in the cloud

Contents

Overview

The DeepOps project aims to facilitate deployment of multi-node GPU clusters for Deep Learning and HPC environments, in an on-prem, optionally air-gapped datacenter or in the cloud.

This document is written as a step-by-step guide which should allow for a person with minimal Linux system administration experience to install and configure an entire cluster from scratch. More experienced administrators should be able to pick and choose items that may be useful, it is not required to follow all steps in the guide if existing software or infrastructure is to be used.

Installation involves first bootstraping management server(s) with a Kubernetes installation and persistent volume storage using Ceph. Cluster services for provisioning operating systems, monitoring, and mirroring container and package repos are then deployed on Kubernetes. From there, DGX servers are booted and installed with the DGX base OS, and Kubernetes is extended across the entire cluster to facilitate job management. An optional login server can be used to allow users a place to interact with data locally and launch jobs. The Slurm job scheduler can also be installed in parallel with Kubernetes to facilitate easier large-scale training jobs or more traditional HPC workloads.

For more information on deploying DGX in the datacenter, consult the DGX Data Center Reference Design Whitepaper

Prerequisites

Hardware Requirements

  • 1 or more CPU-only servers for management
    • 3 or more servers can be used for high-availability
    • Minimum: 4 CPU cores, 16GB RAM, 100GB hard disk
      • More storage required if storing containers in registry, etc.
      • More RAM required if running more services on kubernetes or using one/few servers
    • Ubuntu 16.04 LTS installed
  • 1 or more DGX compute nodes
  • Laptop or workstation for provisioning/deployment
  • (optional) 1 CPU-only server for user job launch, data management, etc.

Software Requirements

The administrator's provisioning system should have the following installed:

  • Ansible 2.5 or later
  • git
  • docker (to build containers)
  • ipmitool
  • python-netaddr (for kubespray)

The management server(s) should be pre-installed with Ubuntu 16.04 LTS before starting the installation steps. If you already have a bare-metal provisioning system, it can be used to install Ubuntu on the management server(s). Integrating the DGX Base OS with other bare-metal provisioning systems is outside the scope of this project.

Network Requirements

The DeepOps service container "DGXie" provides DHCP, DNS, and PXE services to the cluster, and will allow you to automatically install the official DGX base OS on DGX servers. If you elect to use this management service, you will need to have a dedicated network segment and subnet which can be controlled by the DHCP server.

Installation Steps

Installation Overview

  1. Download and configure DeepOps repo
  2. Deploy management server(s)
    • Bootstrap
    • Deploy Kubernetes
    • Deploy Ceph persistent storage on management nodes
  3. Deploy cluster service containers on Kubernetes
    • DHCP/DNS/PXE, container registry, Apt repo, monitoring, alerting
  4. Deploy DGX-1 compute nodes
    • Install DGX OS (via PXE), bootstrap (via Ansible)
    • Update firmware (via Ansible, if required)
    • Join DGX-1 compute nodes to Kubernetes cluster and deploy GPU device plugin
  5. Deploy login node
    • Install OS (via PXE), bootstrap (via Ansible)
    • Install/build HPC software and modules
  6. Deploy cluster SW layers
    • Install Slurm HPC scheduler on login and compute nodes
    • Configure Kubernetes Oauth integration for user access

1. Download and configure

Download the DeepOps repo onto the provisioning system and copy the example configuration files so that you can make local changes:

git clone --recursive https://github.com/NVIDIA/deepops.git
cp -r config.example/ config/
ansible-galaxy install -r requirements.yml

Note: In Git 2.16.2 or later, use --recurse-submodules instead of --recursive. If you did a non-recursive clone, you can later run git submodule update --init --recursive to pull down submodules

The config/ directory is ignored by git, so a new git repository can be created in this directory to track local changes:

cd config/
git init .
git add .
git commit -am 'initial commit'

Use the config/inventory file to set the cluster server hostnames, and optional per-host info like IP addresses and network interfaces. The cluster should ideally use DNS, but you can also explicitly set server IP addresses in the inventory file.

Optional inventory settings:

  • Use the ansible_host variable to set alternate IP addresses for servers or for servers which do not have resolvable hostnames
  • Use the ib_bond_addr variable to configure the infiniband network adapters with IPoIB in a single bonded interface

Configure cluster parameters by modifying the various yaml files in the config/group_vars directory. The cluster-wide global config resides in the all.yml file, while group-specific options reside in the other files. File names correspond to groups in the inventory file, i.e. [dgx-servers] in the inventory file corresponds with config/group_vars/dgx-servers.yml.

2. Management server setup

The configuration assumes a single cpu-only management server, but multiple management servers can be used for high-availability.

Install the latest version of Ubuntu Server 16.04 LTS on each management server. Be sure to enable SSH and record the user and password used during install.

Bootstrap:

The password and SSH keys added to the ubuntu user in the config/group_vars/all.yml file will be configured on the management node. You should add an SSH key to the configuration file, or you will have to append the -k flag and type the password for the ubuntu user for all Ansible commands following the bootstrap.

Deploy management node(s):

Type the password for the user you configured during management server OS installation when prompted to allow for the use of sudo when configuring the management servers. If the management servers were installed with the use of SSH keys and sudo does not require a password, you may omit the -k and -K flags

ansible-playbook -l mgmt -k -K ansible/playbooks/bootstrap.yml

Where mgmt is the group of servers in your config/inventory file which will become management servers for the cluster.

To run arbitrary commands in parallel across nodes in the cluster, you can use ansible and the groups or hosts defined in the inventory file, for example:

ansible mgmt -a hostname

For more info, see: https://docs.ansible.com/ansible/latest/user_guide/intro_adhoc.html

Apply additional changes to management servers to disable swap (required for Kubernetes):

ansible mgmt -b -a "swapoff -a"

If you need to configure a secondary network interface for the private DGX network, modify /etc/network/interfaces. For example:

auto ens192
    iface ens192 inet static
    address 192.168.1.1/24
    dns-nameservers 8.8.8.8 8.8.4.4
    gateway 192.168.1.1
    mtu 1500

Kubernetes:

Deploy Kubernetes on management servers:

Modify the file config/kube.yml if needed and deploy Kubernetes:

ansible-playbook -l mgmt -v -b --flush-cache --extra-vars "@config/kube.yml" kubespray/cluster.yml

Set up Kubernetes for remote administration:

ansible mgmt -b -m fetch -a "src=/etc/kubernetes/admin.conf flat=yes dest=./"
curl -LO https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl
chmod +x ./kubectl

To make administration easier, you may want to copy the kubectl binary to someplace in your $PATH and copy the admin.conf configuration file to ~/.kube/config so that it is used by default. Otherwise, you may use the kubectl flag --kubeconfig=./admin.conf instead of copying the configuration file.

If you have an existing Kubernetes configuration file, you can merge the two with:

mv ~/.kube/config{,.bak} && KUBECONFIG=./admin.conf:~/.kube/config.bak kubectl config view --flatten | tee ~/.kube/config

Test you can access the kubernetes cluster:

$ kubectl get nodes
NAME      STATUS    ROLES         AGE       VERSION
mgmt01    Ready     master,node   7m        v1.11.0

Helm:

Some services are installed using Helm, a package manager for Kubernetes.

Install the Helm client by following the instructions for the OS on your provisioning system: https://docs.helm.sh/using_helm/#installing-helm

If you're using Linux, the script scripts/helm_install_linux.sh will set up Helm for the current user

Be sure to install a version of Helm matching the version in config/kube.yml

(Optional) If helm_enabled is true in config/kube.yml, the Helm server will already be deployed in Kubernetes. If it needs to be installed manually for some reason, run:

kubectl create sa tiller --namespace kube-system
kubectl create clusterrolebinding tiller --clusterrole cluster-admin --serviceaccount=kube-system:tiller
helm init --service-account tiller --node-selectors node-role.kubernetes.io/master=true

Ceph:

Persistent storage for Kubernetes on the management nodes is supplied by Ceph. Ceph is provisioned using Rook to simplify deployment:

helm repo add rook-stable https://charts.rook.io/stable
helm install --namespace rook-ceph-system --name rook-ceph rook-stable/rook-ceph --version v0.9.1 --set agent.flexVolumeDirPath=/var/lib/kubelet/volume-plugins/
kubectl create -f services/rook-cluster.yml

Note: It will take a few minutes for containers to be pulled and started. Wait for Rook to be fully installed before proceeding

You can check Ceph status with:

kubectl -n rook-ceph exec -ti rook-ceph-tools ceph status

3. Services

Ingress controller:

An ingress controller routes external traffic to services.

Modify config/ingress.yml if needed and install the ingress controller:

helm install --values config/ingress.yml stable/nginx-ingress

You can check the ingress controller logs with:

kubectl logs -l app=nginx-ingress

DHCP/DNS/PXE server (DGXie):

DGXie is an all-in-one container for DHCP, DNS, and PXE, specifically tailored to the DGX Base OS. If you already have DHCP, DNS, or PXE servers you can skip this step.

Setup

You will need to download the official DGX Base OS ISO image to your provisioning machine. The latest DGX Base OS is available via the NVIDIA Entperprise Support Portal (ESP).

Copy the DGX Base OS ISO to shared storage via a container running in Kubernetes, substituting the path to the DGX ISO you downloaded (be sure to wait for the iso-loader POD to be in the Running state before attempting to copy the ISO):

kubectl apply -f services/iso-loader.yml
kubectl cp /path/to/DGXServer-3.1.2.170902_f8777e.iso $(kubectl get pod -l app=iso-loader -o custom-columns=:metadata.name --no-headers):/data/iso/

Note: If the iso-loader POD fails to mount the CephFS volume, you may need to restart the kubelet service on the master node(s): ansible mgmt -b -a "systemctl restart kubelet"

Configure

Modify the DGXie configuration in config/dgxie.yml to set values for the DHCP server and DGX install process

Modify config/dhcpd.hosts.conf to add a static IP lease for each login node and DGX server in the cluster if required. IP addresses should match those used in the config/inventory file. You may also add other valid configuration options for dnsmasq to this file.

You can get the MAC address of DGX system interfaces via the BMC, for example:

# interface 1
ipmitool -I lanplus -U <username> -P <password> -H <DGX BMC IP> raw 0x30 0x19 0x00 0x02 | tail -c 18 | tr ' ' ':'
# interface 2
ipmitool -I lanplus -U <username> -P <password> -H <DGX BMC IP> raw 0x30 0x19 0x00 0x12 | tail -c 18 | tr ' ' ':'

Modify config/machines.json to add a PXE entry for each DGX. Copy the dgx-example section and modify the MAC address for each DGX you would like to boot. You can modify boot parameters or install alternate operating systems if required.

Store the config files as config-maps in Kubernetes, even if you have not made any changes (the DGXie container will try to mount these config maps):

kubectl create configmap dhcpd --from-file=config/dhcpd.hosts.conf
kubectl create configmap pxe-machines --from-file=config/machines.json

Deploy DGXie service

Launch the DGXie service:

helm install --values config/dgxie.yml services/dgxie

Check the DGXie logs to make sure the services were started without errors:

kubectl logs -l app=dgxie

Configure the management server(s) to use DGXie for cluster-wide DNS:

ansible-playbook -l mgmt ansible/playbooks/resolv.yml

If you later make changes to config/dhcpd.hosts.conf, you can update the file in Kubernetes and restart the service with:

kubectl create configmap dhcpd --from-file=config/dhcpd.hosts.conf -o yaml --dry-run | kubectl replace -f -
kubectl delete pod -l app=dgxie

If you make changes to machines.json, you can update the file without having to restart the DGXie POD:

kubectl create configmap pxe-machines --from-file=config/machines.json -o yaml --dry-run | kubectl replace -f -

APT Repo:

Launch service. Runs on port 30000: http://mgmt:30000

kubectl apply -f services/apt.yml

Container Registry:

Modify config/registry.yml if needed and launch the container registry:

helm repo add stable https://kubernetes-charts.storage.googleapis.com
helm install --values config/registry.yml stable/docker-registry --version 1.6.1

Once you have provisioned DGX servers, configure them to allow access to the local (insecure) container registry:

ansible-playbook -k ansible/playbooks/docker.yml

You can check the container registry logs with:

kubectl logs -l app=docker-registry

The container registry will be available to nodes in the cluster at registry.local, for example:

# pull container image from docker hub
docker pull busybox:latest

# tag image for local container registry
# (you can also get the image ID manually with: docker images)
docker tag $(docker images -f reference=busybox --format "{{.ID}}") registry.local/busybox

# push image to local container registry
docker push registry.local/busybox

Monitoring:

Cluster monitoring is provided by Prometheus and Grafana

Service addresses:

Where mgmt represents a DNS name or IP address of one of the management hosts in the kubernetes cluster. The default login for Grafana is admin for the username and password.

Modify config/prometheus-operator.yml and config/kube-prometheus.yml if desired and deploy the monitoring and alerting stack:

helm repo add coreos https://s3-eu-west-1.amazonaws.com/coreos-charts/stable/
helm install coreos/prometheus-operator --name prometheus-operator --namespace monitoring --values config/prometheus-operator.yml
kubectl create configmap kube-prometheus-grafana-gpu --from-file=config/gpu-dashboard.json -n monitoring
helm install coreos/kube-prometheus --name kube-prometheus --namespace monitoring --values config/kube-prometheus.yml

To collect GPU metrics, label each GPU node and deploy the DCGM Prometheus exporter:

kubectl label nodes <gpu-node-name> hardware-type=NVIDIAGPU
kubectl create -f services/dcgm-exporter.yml

Logging:

Centralized logging is provided by Filebeat, Elasticsearch and Kibana

Note: The ELK Helm chart is current out of date and does not provide support for setting the Kibana NodePort

todo:

  • filebeat syslog module needs to be in UTC somehow, syslog in UTC?
  • fix kibana nodeport issue

Make sure all systems are set to the same timezone:

ansible all -k -b -a 'timedatectl status'

To update, use: `ansible -k -b -a 'timedatectl set-timezone '

Install Osquery:

ansible-playbook -k ansible/playbooks/osquery.yml

Deploy Elasticsearch and Kibana:

helm repo add incubator http://storage.googleapis.com/kubernetes-charts-incubator
helm install --name elk --namespace logging --values config/elk.yml incubator/elastic-stack

The ELK stack will take several minutes to install, wait for elasticsearch to be ready in Kibana before proceeding.

Launch Filebeat, which will create an Elasticsearch index automatically:

helm install --name log --namespace logging --values config/filebeat.yml stable/filebeat

The logging stack can be deleted with:

helm del --purge log
helm del --purge elk
kubectl delete statefulset/elk-elasticsearch-data
kubectl delete pvc -l app=elasticsearch
# wait for all statefulsets to be removed before re-installing...

4. DGX compute nodes:

Provisioning:

Provision DGX nodes with the official DGX ISO over PXE boot using DGXie.

Note: The scripts/do_ipmi.sh script has these commands and can be looped over multiple hosts

Disable the DGX IPMI boot device selection 60s timeout, you only need to do this once for each DGX, but it is required:

ipmitool -I lanplus -U <username> -P <password> -H <DGX BMC IP> raw 0x00 0x08 0x03 0x08

Note: The default IPMI username and password is qct.admin

Set the DGX to boot from the first disk, using EFI, and to persist the setting:

ipmitool -I lanplus -U <username> -P <password> -H <DGX BMC IP> raw 0x00 0x08 0x05 0xe0 0x08 0x00 0x00 0x00

Set the DGX to boot from the network in EFI mode, for the next boot only. If you set the DGX to always boot from the network, they will get stuck in an install loop. The installer should set the system to boot to the first disk via EFI after the install is finished

ipmitool -I lanplus -U <username> -P <password> -H <DGX BMC IP> chassis bootdev pxe options=efiboot

Note: If you have manually modified the boot order in the DGX SBIOS, you may need to manually return it to boot from disk by default before running the IPMI commands above to alter the boot order

Power cycle/on the DGX to begin the install process

ipmitool -I lanplus -U <username> -P <password> -H <DGX BMC IP> power cycle

The DGX install process will take approximately 15 minutes. You can check the DGXie logs with:

kubectl logs -l app=dgxie

If your DGX are on an un-routable subnet, uncomment the ansible_ssh_common_args variable in the config/group_vars/dgx-servers.yml file and modify the IP address to the IP address of the management server with access to the private subnet, i.e.

ansible_ssh_common_args: '-o ProxyCommand="ssh -W %h:%p -q ubuntu@10.0.0.1"'

Test the connection to the DGX servers via the bastion host (management server). Type the password for dgxuser on the DGX when prompted. The default password for dgxuser is DgxUser123:

ansible dgx-servers -k -a 'hostname'

Configuration:

Configuration of the DGX is accomplished via Ansible roles. Various playbooks to install components are available in ansible/playbooks. Modify the file ansible/site.yml to enable or disable various playbooks, or run playbooks directly.

Type the default password for dgxuser on the DGX when prompted while running the bootstrap playbook. The default password for dgxuser is DgxUser123:

ansible-playbook -k -K -l dgx-servers ansible/playbooks/bootstrap.yml

After running the first command, you may omit the -K flag on subsequent runs. The password for the deepops user will also change to the one set in the groups_vars/all.yml file (by default, this password is deepops). Run the site playbook to finish configuring the DGX:

ansible-playbook -k -l dgx-servers ansible/site.yml

Updating Firmware:

Firmware on the DGX can be updated through the firmware update container(s) and Ansible.

  1. Download the firmware update container package from the NVIDIA Enterprise Support Portal. Updates are published as announcements on the support portal (example: https://goo.gl/3zimCk). Make sure you download the correct package depending on the GPU in the DGX-1:
  2. Once you've download the .tar.gz file, copy or move it inside containers/dgx-firmware
  3. Edit the value of firmware_update_container in the file ansible/roles/nvidia-dgx-firmware/vars/main.yml to match the name of the downloaded firmware container.
  4. Run the Ansible playbook to update DGX firmware:
ansible-playbook -k -l dgx-servers ansible/playbooks/firmware.yml

Adding DGX to Kubernetes:

Create the NVIDIA GPU k8s device plugin daemon set (just need to do this once):

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.11/nvidia-device-plugin.yml

If the DGX is a member of the Slurm cluster, be sure to drain node in Slurm so that it does not accept Slurm jobs. From the login node, run:

sudo scontrol update node=dgx01 state=drain reason=k8s

Modify the config/inventory file to add the DGX to the kube-node and k8s-gpu categories by uncommenting the dgx-servers entry in these sections

Re-run Kubespray to install Kubernetes on the DGX:

ansible-playbook -l k8s-cluster -k -v -b --flush-cache --extra-vars "@config/kube.yml" kubespray/cluster.yml

Note: If the kubesray run fails for any reason, try running again

Check that the installation was successful:

$ kubectl get nodes
NAME      STATUS    ROLES         AGE       VERSION
dgx01     Ready     node          3m        v1.11.0
mgmt01    Ready     master,node   2d        v1.11.0

Place a hold on the docker-ce package so it doesn't get upgraded:

ansible dgx-servers -k -b -a "apt-mark hold docker-ce"

Install the nvidia-container-runtime on the DGX:

ansible-playbook -l k8s-gpu -k -v -b --flush-cache --extra-vars "@config/kube.yml" playbooks/k8s-gpu.yml

Test that GPU support is working:

kubectl apply -f tests/gpu-test-job.yml
kubectl exec -ti gpu-pod -- nvidia-smi -L
kubectl delete pod gpu-pod

5. Login server:

Note: If you do not require a login node, you may skip this section

Note: By default the login node(s) are not part of the kubernetes cluster. If you need to add login node(s) to the kubernetes cluster, add login servers to the kubernetes categories in the config/inventory file and re-run the ansible playbooks as above for management and DGX servers.

Provisioning:

Modify config/dhcpd.hosts.conf to add a static IP lease for each login node if required. IP addresses should match those used in the config/inventory file.

Update the dhcpd.hosts.conf config map if modified and restart the DGXie POD:

kubectl create configmap dhcpd --from-file=config/dhcpd.hosts.conf -o yaml --dry-run | kubectl replace -f -
kubectl delete pod -l app=dgxie

Modify config/machines.json to add a PXE entry for each login node. Copy the 64-bit-ubuntu-example section and modify the MAC address for each login node you would like to boot. You can modify boot parameters or install alternate operating systems if required.

Update the PXE server config map:

kubectl create configmap pxe-machines --from-file=config/machines.json -o yaml --dry-run | kubectl replace -f -

Set login nodes to boot from the network for the next boot only and power on the systems. The login nodes should receive a response from the DGXie service and begin the OS install process.

Note: Be sure to either monitor the PXE install or configure servers to boot from the network on the next boot only to avoid a re-install loop

If manually configuring the install, be sure the initial user matches the user in config/group_vars/login.yml.

Configuration:

Once OS installation is complete, bootstrap and configure the login node(s) via Ansible.

If your login nodes are on an un-routable subnet, uncomment the ansible_ssh_common_args variable in the config/group_vars/login.yml file and modify the IP address to the IP address of the management server with access to the private subnet, i.e.

ansible_ssh_common_args: '-o ProxyCommand="ssh -W %h:%p -q ubuntu@10.0.0.1"'

Various playbooks to install components are available in ansible/playbooks. Modify the file ansible/site.yml to enable or disable various playbooks, or run playbooks directly:

ansible-playbook -k -K -l login ansible/playbooks/bootstrap.yml
ansible-playbook -k -l login ansible/site.yml

6. Additional Components

Slurm:

Slurm overview: https://slurm.schedmd.com/overview.html

"Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters."

Note: For more information on Slurm and GPUs, see: https://github.com/dholt/slurm-gpu

To install Slurm, configure nodes in config/inventory and run the Ansible playbook:

ansible-playbook -k -l slurm-cluster ansible/playbooks/slurm.yml

DGX nodes may appear 'down' in Slurm after install due to rebooting. Set nodes to idle if required:

sudo scontrol update node=dgx01 state=idle

Cluster Usage

Maintenance

Login server

Adding Software:

To modify installed software on cluster nodes, edit the package list in ansible/roles/software/defaults/main.yml and apply the changes:

ansible-playbook -k -l login ansible/playbooks/software.yml

The playbooks/extra.yml file contains optional configuration (these will be moved at a later date):

ansible-playbook -k -l all playbooks/extra.yml

Building software:

HPC clusters generally utilize a system of versioned software modules instead of installing software via the OS package manager. These software builds can be made easier with the EasyBuild tool. The software build environment should be set up on the login node in a shared directory accessible by all cluster nodes.

Assuming you created or used an existing NFS share during cluster bootstrap, create a directory to hold software builds and create a direnv file to facilitate easier EasyBuild builds:

EasyBuild environment file:

$ cat /shared/.envrc
export EASYBUILD_PREFIX=/shared/sw
export EASYBUILD_MODULES_TOOL=Lmod
export EASYBUILD_JOB_BACKEND=GC3Pie
export EASYBUILD_JOB_BACKEND_CONFIG=/shared/.gc3pie.cfg
module use /shared/sw/modules/all
module load EasyBuild

Where the shared NFS directory is /shared, and initial software/modules built with EasyBuild are installed in /shared/sw.

The direnv package should have been installed by default during cluster node configuration. For more information on direnv, see: https://direnv.net/.

Use direnv to automatically set your EasyBuild environment; first add an appropriate command to your shell login scripts:

type direnv >/dev/null 2>&1 && eval "$(direnv hook bash)"

Then cd /shared and run direnv allow. The .envrc file should set up the environment to use EasyBuild

Install EasyBuild using the shared directory as the install path:

# pick an installation prefix to install EasyBuild to (change this to your liking)
EASYBUILD_PREFIX=/shared/sw

# download script
curl -O https://raw.githubusercontent.com/easybuilders/easybuild-framework/develop/easybuild/scripts/bootstrap_eb.py

# bootstrap EasyBuild
python bootstrap_eb.py $EASYBUILD_PREFIX

# update $MODULEPATH, and load the EasyBuild module
module use $EASYBUILD_PREFIX/modules/all
module load EasyBuild

Example usage for building software:

# search
eb -S gcc-6
# build
eb GCC-6.4.0-2.28.eb -r

Example usage for using software:

# prepend environment module path
export MODULEPATH=$EASYBUILD_PREFIX/modules/all:$MODULEPATH

# load environment module
module load HPL

Cluster-wide

Slurm updates:

# whole shebang:
ansible-playbook -k -l slurm-cluster ansible/playbooks/slurm.yml
# just prolog and/or epilog:
ansible-playbook -k -l compute-nodes --tags prolog,epilog -e 'gather_facts=no' ansible/playbooks/slurm.yml

Modify GPU drivers:

ansible-playbook -k -l <dgx-hostname> playbooks/gpu-driver.yml

Extra:

Set up /raid RAID-0 array cache (can also add rebuild-raid to PXE boot cmdline when installing):

ansible dgx-servers -k -b -a "/usr/bin/configure_raid_array.py -i"

Un-freeze NVLINK counters (may want to use 0brw for just read/write):

ansible dgx-servers -k -b -a "nvidia-smi nvlink -sc 0bz"

Kubernetes

Managing DGX scheduler allocation:

Once the DGX compute nodes have been added to Kubernetes and Slurm, you can use the scripts/doctl.sh script to manage which scheduler each DGX is allowed to run jobs from.

NVIDIA GPU Cloud Container Registry (NGC):

Create secret for registry login:

kubectl create secret docker-registry ngc --docker-server=nvcr.io --docker-username='$oauthtoken' --docker-password=<api-key> --docker-email='foo@example.com'

Add to Kubernetes pod spec:

  imagePullSecrets:
    - name: ngc

Upgrading Helm Charts:

If you make changes to configuration or want to update Helm charts, you can use the helm upgrade command to apply changes

Show currently installed releases:

helm list

To upgrade the ingress controller with new values from config/ingress.yml for example, you would run:

helm upgrade --values config/ingress.yml <release_name> stable/nginx-ingress

Where <release_name> is the name of the deployed ingress controller chart obtained from helm list.

Kubernetes user access:

TODO:

  • (done) restrict namespace to nodes with specific labels, i.e. scheduler=k8s
  • wait for k8s fix to daemonset and PodNodeSelector issues

Using OAuth2

References: https://medium.com/@jessgreb01/kubernetes-authn-authz-with-google-oidc-and-rbac-74509ca8267e

Copy admin.conf and ca.pem from a kube master (i.e. mgmt01) to /root/.kube on the login node (i.e. login01).

Generate an OAUTH2 client JSON config file and copy the user script to the login node:

sudo mkdir -p /shared/{bin,etc}
sudo cp scripts/k8s_user.sh /shared/bin/
sudo chmod +x /shared/bin/k8s_user.sh
sudo cp config/google_oauth2_client.json /shared/etc/

Download kubectl and ks (ksonnet) and put in /shared/bin

Users can run the script to log in to Google Auth, generate tokens and create a kube config: sudo /shared/bin/k8s_user.sh

Restrict Namespaces:

todo: a daemonset will still continuously try and fail to schedule pods on all nodes

User namespaces need to be restricted to nodes which are in k8s scheduling mode. Otherwise users can run pods on management nodes and nodes which are being managed by Slurm (via a DaemonSet for example).

Update the Kubespray config in config/kube.yml to tell the Kube API server to use the PodNodeSelector admission controller (this should already be the default):

kube_apiserver_admission_control:
  ...
  - PodNodeSelector

Patch namespaces to apply a specific node selector to every pod:

kubectl patch namespace <username> -p '{"metadata":{"annotations":{"scheduler.alpha.kubernetes.io/node-selector":"scheduler=k8s"}}}'
kubectl get ns <username> -o yaml

Where <username> is the name of the namespace, typically the same as the username

Using certs

Source: https://docs.bitnami.com/kubernetes/how-to/configure-rbac-in-your-kubernetes-cluster/

Copy the script to one of the management nodes and run to create a user:

scp scripts/add_user.sh mgmt-01:/tmp
ssh mgmt-01 /tmp/add_user.sh <username>
scp mgmt-01:~/<username>.kubeconfig ~/.kube/config

Where <username> is the name of the new user account being created

Kubernetes add-ons:

Service Mesh:

This may be needed for L7 load-balancing for GRPC services

kubectl apply -f services/ambassador-service.yml
kubectl apply -f services/ambassador-rbac.yml

Troubleshooting

If Ansible complains that a variable is undefined, you can check node values with something like:

ansible all -m debug -a "var=ansible_default_ipv4"

Where ansible_default_ipv4 is the variable in question

Rook:

If you need to remove Rook for any reason, here are the steps:

kubectl delete -f services/rook-cluster.yml
helm del --purge rook-ceph
kubectl delete namespace rook-ceph-system
ansible all -b -m file -a "path=/var/lib/rook state=absent"

If the first task doesn't work, you might want to pause that process and first remove the finalizers from the cluster:

kubectl -n rook-ceph patch cephclusters.ceph.rook.io rook-ceph -p '{"metadata":{"finalizers": []}}' --type=merge

In some cases, you also have to manually erase the rook-ceph-block StorageClass:

kubectl delete storageclass rook-ceph-block

Open Source Software

Software used in this project:

Copyright and License

This project is released under the BSD 3-clause license.

Issues and Contributing

A signed copy of the Contributor License Agreement needs to be provided to deepops@nvidia.com before any change can be accepted.

About

Tools for building GPU clusters

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Shell 63.5%
  • Python 27.5%
  • Ruby 2.5%
  • Makefile 2.4%
  • Dockerfile 2.4%
  • Smarty 1.7%