Santos Dumont Super Computer

Accessing and using the SDumont infrastructure for Deep Learning research. — August 30, 2021

This post introduces and details the access and usage process of the Santos Dumont Super Computer, managed by LNCC. This document is profoundly based on the official support manual provided by LNCC, as well as my personal experience in the program (hence, it may not perfectly represent all cases). Its goal is to present information in a more directed manner for users that share my own profile (Deep Learning researchers that prefer the TensorFlow framework and who are familiar with Docker).

SDumont

Santos Dumont is a Brazilian super computer located in the city of Petrópolis, Rio de Janeiro State, Brazil. A general description of its specs and processing power can be found at sdumont.lncc.br/machine.

External view of Santos Dumont Supercomputer Installations.
External view of Santos Dumont Supercomputer Installations. Available at gov.br/mcti.

Brazilian scientist and research groups are encouraged to apply for a research project proposal, which will grant them usage of the infrastructure available. Furthermore, its processing power can also be used for educational purposes. For this end, educators must first present a course plan and specify usage details.

Project Proposal

Engagement process starts at this website. It states proposals will be evaluated up to November 27th, 2021. To be considered, you must fill in this application and submit it through the JEMS system.

First Access and Setup

Once a proposal is accepted, a welcoming e-mail is sent from jems@sbc.org.br to all authors. The e-mail contains details and comments from reviewers regarding the project proposal, as well as grading score per evaluation category.

User and Project Registration

The welcoming e-email will ask the project members to fill in two two forms:

  • Project Coordinator Form: should be filled in by the project coordinator, which formalizes the project scope (by repeating what was written in the proposal or by updating it with justifiable notice) and lists all members involved in the project. link
  • Project User Form: should be filled by each member that was listed in the Project Coordinator Form (one form per member). link

Identification information (such as “Registro Geral”, or R.G.) must be provided in both forms, which must be printed and signed (or digitally signed). The forms, as well as the scans of ID documents for each project member and coordinator, must be e-mailed to helpdesk-sdumont@lncc.br and sdumont@lncc.br. There is a timeframe of 30 days to complete this stage of registration.

Once it is done, you will receive the following confirmation e-mail:

FROM: helpdesk-sdumont@lncc.br
SUBJECT: Re: Formulários de cadastro SDumont - ID {ID_NUMBER}

Prezado {NAME},

Registramos os chamados abaixo para criação do projeto e contas de usuários:

Chamado - Abertura de conta de projeto SDumont - {PROJECT_CODE} - ID {PROJECT_ID}
Chamado - Abertura de conta de usuário SDumont - {NAME} - ID {USER_ID}
Chamado - Abertura de conta de usuário SDumont - {NAME} - ID {USER_ID}
...

Wait for a day. Each member listed by the Coordinator will receive their own registration confirmation e-mail. You will find your name and ID detailed in the confirmation e-mail’s subject. Your temporary password, however, is not stored in the e-mail (or anywhere else). You must call LNCC on their provided phone number and tell them you have just received the confirmation e-mail. Inform your name and ID number and they will verbally inform you of your password.

Configure and open the VPN tunnel (using the instructions in the e-mail and the credentials provided on the call). I use linux and opted for the graphic VPN interface, so the following packages are necessary:

sudo apt-get install vpnc network-manager-vpnc network-manager-vpnc-gnome

Once it is opened, you can ssh into the buttler host:

ssh USERNAME@login.sdumont.lncc.br
password: ******

Use your temporary password. Once you are logged, you will be asked to change it. Once this is done, you will be disconnected from the VPN. Update its settings to reflect the new pass, if necessary, and open the tunnel once again. SSH into the butler once more and you will now be connected to SDumont!

Resources

Santos Dumont uses Slurm Workload Manager to orchestrate its nodes. Two important aspects to understand about this manager is the distributed storage system employed and its job scheduling mechanism.

Storage

Two storage partitions are available [ref]:

Name Capacity Location Description
Scratch 25 TB /scratch/{PROJECT}/{USER} input, transient and output data
Home 5 TB /prj/{PROJECT}/{USER} source-code, libraries

* Scratch is 50 TB in Premium tier projects.

Files in scratch are erased after 60 days without being updated, and should be backed up in Home. Nodes cannot access files in Home, so every source code, dataset and associated archive must be copied to the scratch partition before usage. For example, say you have the file job.py inside the experiments folder. The entire experiments folder can be synchronized with its counterpart in scratch partition using the following command:

rsync -rUv --delete experiments/ $SCRATCH/experiments/

Once the experiments are run, and their logs are produced, we can copy them to the home partition once again by:

rsync -rUv $SCRATCH/logs logs

GPUs

I listed below the most interesting GPU queues for Machine Learning, as well as their specs.

name Wall time (hours) CPU Cores Memory GPU Cost
nvidia_dev 0.33 2 24 64 2 K40 1.5
nvidia_small 1 2 24 64 2 K40 1.5
nvidia_long 744 (31 days) 2 24 64 2 K40 1.5
nvidia_scal 18 102 1224 6528 102 K40 1.5
nvidia 48 42 504 2688 42 K40 1.5
sequana_gpu 96 2 48 384 4 V100 1.5
sequana_gpu_dev 0.33 2 48 384 4 V100 1.5
sequana_gpu_long 744 (31 days) 2 48 384 4 V100 1.5
gdl 48 2 40 384 8 V100 2.0

Submitting Jobs

Jobs can be submitted in two forms: interactively and queued for execution. During submission, you must inform the queue in which the job will be placed. For the examples below, assume you have a python script named experiment.py that can run and evaluate your experiment.

NVIDIA Queues

Should be used for the regular experiments (two GPUs with 12 GB are available).

Interactive Access

This nodes can be interactively accessed through the following code:

$ salloc --nodes=1 -p {queue} -J {name} --exclusive

salloc: Pending job allocation {job_id}
salloc: job {job_id} queued and waiting for resources
salloc: job {job_id} has been allocated resources
salloc: Granted job allocation {job_id}

$ squeue -j {job_id}
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          {job_id}   {queue}   {name}   {user}  R       0:21      1 {node_id}

$ ssh {node_id}

Now go to this tensorflow.org/install and check which CUDA and python should you be using for your specific tensorflow version. For example, I’m using tensorflow==2.6.0, so I would use python>3.7 and CUDA>11.2:

# Load appropriate libraries
$ module load gcc/7.4 python/3.9.1 cudnn/8.2_cuda-11.1
$ pip install tensorflow==2.6.0 tensorflow-datasets tensorflow-addons

# Experiment run
$ python experiment.py

$ exit  # Exit {NODE_ID}
$ exit  # Dispose {job_id}
$ exit  # Exit SSH

Enqueueing Experiments

To enqueue jobs in the cluster, you must create a sbatch run file, indicating the execution parameters for the experiment.

For instance, supose you have writen the runners/cityscapes.sh file, with the following content:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks=1
#SBATCH -p nvidia_long
#SBATCH -J ml_train_multi_gpu
#SBATCH --exclusive

nodeset -e $SLURM_JOB_NODELIST

module load gcc/7.4 python/3.9.1 cudnn/8.2_cuda-11.1

cd ./experiments/

python3.9 cityscapes/train.py with dataset=cityscapes extra=true
python3.9 cityscapes/evaluate.py with dataset=cityscapes extra=true

Both cityscapes/train.py and cityscapes/evaluate.py jobs can be enqueued to run in the nvidia_long queue with the following command:

$ sbatch runners/cityscapes.sh

gdl Queue

Sequana queue, for very deep models (and usage cost of 2.0 UAs).

Interactive Access

salloc --nodes=1 -p gdl -J GDL-teste --exclusive

salloc: Granted job allocation 123456

#verificar quais nós foram alocados para o job
squeue -j 123456
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            123456       gdl     bash usuario1  R       5:28      1 sdumont4000

#acessa o nó
ssh sdumont4000

#carrega o módulo e executa a aplicação
[usuario1@sdumont4000 ~]$ cd /scratch/projeto/usuario1/teste-gdl
[usuario1@sdumont4000 teste-gdl]$ module load deepl/deeplearn-py3.7
[usuario1@sdumont4000 teste-gdl]$ python script.py

#encerra a conexão com o nó
[usuario1@sdumont4000 teste-gdl]$ exit

#encerra a sessão interativa e termina o job
exit
salloc: Relinquishing job allocation 123456

Enqueueing Experiments

Create a job file experiment.srm:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks=1
#SBATCH -p gdl
#SBATCH -J GDL-teste-script
#SBATCH --exclusive

#Exibe os nos alocados para o Job
echo $SLURM_JOB_NODELIST
nodeset -e $SLURM_JOB_NODELIST

cd $SLURM_SUBMIT_DIR

#Configura o módulo de Deep Learning
module load deepl/deeplearn-py3.7

#acessa o diretório onde o script está localizado
cd $SCRATCH/teste-gdl

#executa o script
python experiment.py

and then run it:

sbatch experiment.srm

Reports

You can check a usage report for your project using the sreport command.

$ PROJECT=project-id
$ START=2021-06-01
$ END=2021-10-13
$ sreport -t hours cluster AccountUtilizationByUser start=$START end=$END Accounts=$PROJECT

--------------------------------------------------------------------------------
Cluster/Account/User Utilization 2021-06-01T00:00:00 - 2021-10-12T23:59:59 (11577600 secs)
Use reported in TRES Hours
--------------------------------------------------------------------------------
  Cluster         Account       Login     Proper Name     Used   Energy
--------- --------------- ----------- --------------- -------- --------
  sdumont    {project_id}                                 2340        0
  sdumont    {project_id} {user_id_1}   {user_name_1}     2140        0
  sdumont    {project_id} {user_id_2}   {user_name_2}      200        0
  sdumont    {project_id} {user_id_3}   {user_name_3}        0        0

You can also get a list of all jobs executed so far:

$ sacct -S $START -E $END -X -A $PROJECT

       JobID           JobName  Partition      Account  AllocCPUS      State ExitCode
------------ ----------------- ---------- ------------ ---------- ---------- --------
1321448               hostname nvidia_dev {project_id}          1  COMPLETED      0:0
1336974      {project_id}-tes+ nvidia_lo+ {project_id}         24  COMPLETED      0:0
1337014      {project_id}-tes+ nvidia_lo+ {project_id}         24     FAILED      1:0
1337043      {project_id}-tes+ nvidia_lo+ {project_id}         24     FAILED      1:0
1337044      {project_id}-tes+ nvidia_lo+ {project_id}          1 CANCELLED+      0:0
1339157      {project_id}-tes+ nvidia_dev {project_id}         24  COMPLETED      0:0
1339590      {project_id}-tes+ nvidia_dev {project_id}         24    TIMEOUT      0:0
1339616      {project_id}-tes+ nvidia_dev {project_id}         24    TIMEOUT      0:0
1339647      {project_id}-tes+ nvidia_dev {project_id}          1 CANCELLED+      0:0

Writing Distributed TensorFlow Code

The nodes in the nvidia queues are always associated with 2 or more video cards. Once allocated, each node will be charged in your project’s budget regardless if the two or more cards are being used. Being as such, it’s paramount to maximize the usage of all hardware available, avoiding unnecessary spending and freeing idle resources for the other projects.

TensorFlow follows a greedy policy, in which all available GDRAM is allocated beforehand. This can be turned off with the following statement:

import tensorflow as tf

# Ref.: https://www.tensorflow.org/api_docs/python/tf/config/experimental/set_memory_growth
for d in tf.config.list_physical_devices('GPU'):
  tf.config.experimental.set_memory_growth(d, True)

Only the first GPU (/gpu:0) is used by default. In order to leverage all devices available, one must declare the model using one of the distributed strategies implemented. The following snippet describes the general structure in which the MirrorStrategy can be employed to perform a Model’s fit leveraging all GPUs in a machine.

import tensorflow as tf

def appropriate_distributed_strategy():
  # Ref.: https://www.tensorflow.org/api_docs/python/tf/distribute/Strategy
  return (tf.distribute.MirroredStrategy()
          if tf.config.list_physical_devices('GPU')
          else tf.distribute.get_strategy())

def build_model():
  ...

def build_dataset():
  ...

def run(
    batch_size=32
  ):
  gpus_with_memory_growth()

  dst = appropriate_distributed_strategy()

  # Build network under the distributed scope.
  # Ref.: https://www.tensorflow.org/tutorials/distribute/custom_training
  with dst.scope():
    network = build_model()

    network.compile(...)

  # Build dataset to output data "per-replica".
  # https://www.tensorflow.org/api_docs/python/tf/distribute/Strategy
  ds = build_dataset()

  samples = len(ds)
  batch_size = batch_size * dst.num_replicas_in_sync

  ds = dst.experimental_distribute_dataset(
    ds.prefetch(32 * batch_size)
      .batch(batch_size)
      .repeat()  # repeat is necessary, as far as I can tell.
  )

  # Training.
  # Ref.: https://www.tensorflow.org/tutorials/distribute/keras
  network.fit(ds, steps_per_epoch=samples // batch_size)


if __name__ == "__main__":
  run()

More information around distributed training can be found at TensorFlow’s distributed docs.