This lesson is still being designed and assembled (Pre-Alpha version)

Introduction to HPC for Machine Learning on M3: Glossary

Key Points

What is a cluster?
  • Most HPCs are not supercomputers, but clusters composed of many computers.

  • Each computer in the HPC is a group of resources, called a node. Nodes with the same flavour of resources and access requirements are called partitions.

  • A HPC is shared with other people, so it has a scheduler to ensure everyone has fair access.

Why should I use a HPC?
  • HPCs offer increased resource size across storage, RAM, the number of CPUs, the number and type of GPUs, as well as the ability to perform parallel computing.

  • A HPC may assist you in speeding up your research, but it’s important to spend some time identifying where it can assist your research problem, or modifying the problem to suit the HPC environment.

  • Deep learning in particular lends itself to the HPC environment, and can accelerate your training.

How do I login to a cluster?
What resources are available on the cluster?
  • Different clusters will have different resources available on them.

  • You can check resources on the cluster with the command line, or by using documentation.

How do I request resources on the cluster with smux?
  • The login node should only be used for lightweight housekeeping tasks.

  • Interactive jobs are accessed with the smux command on MASSIVE and still need to be queued for.

  • Interactive jobs are good for prototyping and developing code before submitting jobs.

  • Job submission scripts are an effective and reproducible way to undertake research on the cluster.

  • You can also access MASSIVE resources with desktops and JupyterLab sessions using Strudel2.

  • You can interact with running jobs using squeue, scancel, and show_job.

Introduction
  • First key point. Brief Answer to questions. (FIXME)

Glossary

FIXME