This post introduces and details the access and usage process of the Santos Dumont Super Computer, managed by LNCC. This document is profoundly based on the official support manual provided by LNCC, as well as my personal experience in the program (hence, it may not perfectly represent all cases). Its goal is to present information in a more directed manner for users that share my own profile (Deep Learning researchers that prefer the TensorFlow framework and who are familiar with Docker).
SDumont
Santos Dumont is a Brazilian super computer located in the city of Petrópolis, Rio de Janeiro State, Brazil. A general description of its specs and processing power can be found at sdumont.lncc.br/machine.
Brazilian scientist and research groups are encouraged to apply for a research project proposal, which will grant them usage of the infrastructure available. Furthermore, its processing power can also be used for educational purposes. For this end, educators must first present a course plan and specify usage details.
Project Proposal
Engagement process starts at this website. It states proposals will be evaluated up to November 27th, 2021. To be considered, you must fill in this application and submit it through the JEMS system.
First Access and Setup
Once a proposal is accepted, a welcoming e-mail is sent from jems@sbc.org.br to all authors. The e-mail contains details and comments from reviewers regarding the project proposal, as well as grading score per evaluation category.
User and Project Registration
The welcoming e-email will ask the project members to fill in two two forms:
- Project Coordinator Form: should be filled in by the project coordinator, which formalizes the project scope (by repeating what was written in the proposal or by updating it with justifiable notice) and lists all members involved in the project. link
- Project User Form: should be filled by each member that was listed in the Project Coordinator Form (one form per member). link
Identification information (such as “Registro Geral”, or R.G.) must be provided in both forms, which must be printed and signed (or digitally signed). The forms, as well as the scans of ID documents for each project member and coordinator, must be e-mailed to helpdesk-sdumont@lncc.br and sdumont@lncc.br. There is a timeframe of 30 days to complete this stage of registration.
Once it is done, you will receive the following confirmation e-mail:
FROM: helpdesk-sdumont@lncc.br
SUBJECT: Re: Formulários de cadastro SDumont - ID {ID_NUMBER}
Prezado {NAME},
Registramos os chamados abaixo para criação do projeto e contas de usuários:
Chamado - Abertura de conta de projeto SDumont - {PROJECT_CODE} - ID {PROJECT_ID}
Chamado - Abertura de conta de usuário SDumont - {NAME} - ID {USER_ID}
Chamado - Abertura de conta de usuário SDumont - {NAME} - ID {USER_ID}
...
Wait for a day. Each member listed by the Coordinator will receive their own registration confirmation e-mail. You will find your name and ID detailed in the confirmation e-mail’s subject. Your temporary password, however, is not stored in the e-mail (or anywhere else). You must call LNCC on their provided phone number and tell them you have just received the confirmation e-mail. Inform your name and ID number and they will verbally inform you of your password.
Configure and open the VPN tunnel (using the instructions in the e-mail and the credentials provided on the call). I use linux and opted for the graphic VPN interface, so the following packages are necessary:
sudo apt-get install vpnc network-manager-vpnc network-manager-vpnc-gnome
Once it is opened, you can ssh into the buttler host:
ssh USERNAME@login.sdumont.lncc.br
password: ******
Use your temporary password. Once you are logged, you will be asked to change it. Once this is done, you will be disconnected from the VPN. Update its settings to reflect the new pass, if necessary, and open the tunnel once again. SSH into the butler once more and you will now be connected to SDumont!
Resources
Santos Dumont uses Slurm Workload Manager to orchestrate its nodes. Two important aspects to understand about this manager is the distributed storage system employed and its job scheduling mechanism.
Storage
Two storage partitions are available [ref]:
Name | Capacity | Location | Description |
---|---|---|---|
Scratch | 25 TB | /scratch/{PROJECT}/{USER} |
input, transient and output data |
Home | 5 TB | /prj/{PROJECT}/{USER} |
source-code, libraries |
* Scratch is 50 TB in Premium tier projects.
Files in scratch are erased after 60 days without being updated, and should be backed up in Home.
Nodes cannot access files in Home, so every source code, dataset and associated archive must be copied
to the scratch partition before usage. For example, say you have the file job.py
inside the experiments
folder.
The entire experiments
folder can be synchronized with its counterpart in scratch partition using the following command:
rsync -rUv --delete experiments/ $SCRATCH/experiments/
Once the experiments are run, and their logs are produced, we can copy them to the home partition once again by:
rsync -rUv $SCRATCH/logs logs
GPUs
I listed below the most interesting GPU queues for Machine Learning, as well as their specs.
name | Wall time (hours) | CPU | Cores | Memory | GPU | Cost |
---|---|---|---|---|---|---|
nvidia_dev | 0.33 | 2 | 24 | 64 | 2 K40 | 1.5 |
nvidia_small | 1 | 2 | 24 | 64 | 2 K40 | 1.5 |
nvidia_long | 744 (31 days) | 2 | 24 | 64 | 2 K40 | 1.5 |
nvidia_scal | 18 | 102 | 1224 | 6528 | 102 K40 | 1.5 |
nvidia | 48 | 42 | 504 | 2688 | 42 K40 | 1.5 |
sequana_gpu | 96 | 2 | 48 | 384 | 4 V100 | 1.5 |
sequana_gpu_dev | 0.33 | 2 | 48 | 384 | 4 V100 | 1.5 |
sequana_gpu_long | 744 (31 days) | 2 | 48 | 384 | 4 V100 | 1.5 |
gdl | 48 | 2 | 40 | 384 | 8 V100 | 2.0 |
Submitting Jobs
Jobs can be submitted in two forms: interactively and queued for execution. During submission, you must inform the queue
in which the job will be placed.
For the examples below, assume you have a python script named experiment.py
that can run and evaluate your experiment.
NVIDIA Queues
Should be used for the regular experiments (two GPUs with 12 GB are available).
Interactive Access
This nodes can be interactively accessed through the following code:
$ salloc --nodes=1 -p {queue} -J {name} --exclusive
salloc: Pending job allocation {job_id}
salloc: job {job_id} queued and waiting for resources
salloc: job {job_id} has been allocated resources
salloc: Granted job allocation {job_id}
$ squeue -j {job_id}
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
{job_id} {queue} {name} {user} R 0:21 1 {node_id}
$ ssh {node_id}
Now go to this tensorflow.org/install and check which
CUDA and python should you be using for your specific tensorflow version.
For example, I’m using tensorflow==2.6.0
, so I would use python>3.7 and CUDA>11.2:
# Load appropriate libraries
$ module load gcc/7.4 python/3.9.1 cudnn/8.2_cuda-11.1
$ pip install tensorflow==2.6.0 tensorflow-datasets tensorflow-addons
# Experiment run
$ python experiment.py
$ exit # Exit {NODE_ID}
$ exit # Dispose {job_id}
$ exit # Exit SSH
pip install
in the next experiments.
Enqueueing Experiments
To enqueue jobs in the cluster, you must create a sbatch run file, indicating the execution parameters for the experiment.
For instance, supose you have writen the runners/cityscapes.sh
file, with the following content:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks=1
#SBATCH -p nvidia_long
#SBATCH -J ml_train_multi_gpu
#SBATCH --exclusive
nodeset -e $SLURM_JOB_NODELIST
module load gcc/7.4 python/3.9.1 cudnn/8.2_cuda-11.1
cd ./experiments/
python3.9 cityscapes/train.py with dataset=cityscapes extra=true
python3.9 cityscapes/evaluate.py with dataset=cityscapes extra=true
Both cityscapes/train.py
and cityscapes/evaluate.py
jobs can be enqueued to run in the nvidia_long
queue with the following command:
$ sbatch runners/cityscapes.sh
gdl
Queue
Sequana queue, for very deep models (and usage cost of 2.0 UAs).
Interactive Access
salloc --nodes=1 -p gdl -J GDL-teste --exclusive
salloc: Granted job allocation 123456
#verificar quais nós foram alocados para o job
squeue -j 123456
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
123456 gdl bash usuario1 R 5:28 1 sdumont4000
#acessa o nó
ssh sdumont4000
#carrega o módulo e executa a aplicação
[usuario1@sdumont4000 ~]$ cd /scratch/projeto/usuario1/teste-gdl
[usuario1@sdumont4000 teste-gdl]$ module load deepl/deeplearn-py3.7
[usuario1@sdumont4000 teste-gdl]$ python script.py
#encerra a conexão com o nó
[usuario1@sdumont4000 teste-gdl]$ exit
#encerra a sessão interativa e termina o job
exit
salloc: Relinquishing job allocation 123456
Enqueueing Experiments
Create a job file experiment.srm
:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks=1
#SBATCH -p gdl
#SBATCH -J GDL-teste-script
#SBATCH --exclusive
#Exibe os nos alocados para o Job
echo $SLURM_JOB_NODELIST
nodeset -e $SLURM_JOB_NODELIST
cd $SLURM_SUBMIT_DIR
#Configura o módulo de Deep Learning
module load deepl/deeplearn-py3.7
#acessa o diretório onde o script está localizado
cd $SCRATCH/teste-gdl
#executa o script
python experiment.py
and then run it:
sbatch experiment.srm
Reports
You can check a usage report for your project using the sreport
command.
$ PROJECT=project-id
$ START=2021-06-01
$ END=2021-10-13
$ sreport -t hours cluster AccountUtilizationByUser start=$START end=$END Accounts=$PROJECT
--------------------------------------------------------------------------------
Cluster/Account/User Utilization 2021-06-01T00:00:00 - 2021-10-12T23:59:59 (11577600 secs)
Use reported in TRES Hours
--------------------------------------------------------------------------------
Cluster Account Login Proper Name Used Energy
--------- --------------- ----------- --------------- -------- --------
sdumont {project_id} 2340 0
sdumont {project_id} {user_id_1} {user_name_1} 2140 0
sdumont {project_id} {user_id_2} {user_name_2} 200 0
sdumont {project_id} {user_id_3} {user_name_3} 0 0
You can also get a list of all jobs executed so far:
$ sacct -S $START -E $END -X -A $PROJECT
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ----------------- ---------- ------------ ---------- ---------- --------
1321448 hostname nvidia_dev {project_id} 1 COMPLETED 0:0
1336974 {project_id}-tes+ nvidia_lo+ {project_id} 24 COMPLETED 0:0
1337014 {project_id}-tes+ nvidia_lo+ {project_id} 24 FAILED 1:0
1337043 {project_id}-tes+ nvidia_lo+ {project_id} 24 FAILED 1:0
1337044 {project_id}-tes+ nvidia_lo+ {project_id} 1 CANCELLED+ 0:0
1339157 {project_id}-tes+ nvidia_dev {project_id} 24 COMPLETED 0:0
1339590 {project_id}-tes+ nvidia_dev {project_id} 24 TIMEOUT 0:0
1339616 {project_id}-tes+ nvidia_dev {project_id} 24 TIMEOUT 0:0
1339647 {project_id}-tes+ nvidia_dev {project_id} 1 CANCELLED+ 0:0
Writing Distributed TensorFlow Code
The nodes in the nvidia
queues are always associated with 2 or more video cards.
Once allocated, each node will be charged in your project’s budget regardless if the two or more cards are being used.
Being as such, it’s paramount to maximize the usage of all hardware available, avoiding unnecessary spending and
freeing idle resources for the other projects.
TensorFlow follows a greedy policy, in which all available GDRAM is allocated beforehand. This can be turned off with the following statement:
import tensorflow as tf
# Ref.: https://www.tensorflow.org/api_docs/python/tf/config/experimental/set_memory_growth
for d in tf.config.list_physical_devices('GPU'):
tf.config.experimental.set_memory_growth(d, True)
Only the first GPU (/gpu:0
) is used by default. In order to leverage all devices available,
one must declare the model using one of the
distributed strategies
implemented. The following snippet describes the general structure in which the MirrorStrategy
can be employed to perform a Model’s fit leveraging all GPUs in a machine.
import tensorflow as tf
def appropriate_distributed_strategy():
# Ref.: https://www.tensorflow.org/api_docs/python/tf/distribute/Strategy
return (tf.distribute.MirroredStrategy()
if tf.config.list_physical_devices('GPU')
else tf.distribute.get_strategy())
def build_model():
...
def build_dataset():
...
def run(
batch_size=32
):
gpus_with_memory_growth()
dst = appropriate_distributed_strategy()
# Build network under the distributed scope.
# Ref.: https://www.tensorflow.org/tutorials/distribute/custom_training
with dst.scope():
network = build_model()
network.compile(...)
# Build dataset to output data "per-replica".
# https://www.tensorflow.org/api_docs/python/tf/distribute/Strategy
ds = build_dataset()
samples = len(ds)
batch_size = batch_size * dst.num_replicas_in_sync
ds = dst.experimental_distribute_dataset(
ds.prefetch(32 * batch_size)
.batch(batch_size)
.repeat() # repeat is necessary, as far as I can tell.
)
# Training.
# Ref.: https://www.tensorflow.org/tutorials/distribute/keras
network.fit(ds, steps_per_epoch=samples // batch_size)
if __name__ == "__main__":
run()
More information around distributed training can be found at TensorFlow’s distributed docs.