August 25, 2020
Submitit is a lightweight tool for submitting Python functions for computation within a Slurm cluster. Slurm, an open source, highly scalable job-scheduling system for clusters, is commonly used in both industry and academia. At Facebook AI Research (FAIR), we use a Slurm-administrated cluster with thousands of GPUs on which our researchers train neural networks. Submitit has simplified the task of scheduling an experiment on the cluster and collecting the results, logs, etc. We have released Submitit to help other researchers run their experiments on a Slurm cluster.
Submitit shares the same basic Executor API as the standard concurrent.futures standard library, along with a few other features.
import submitit def add(a, b): return a + b # ask for resources executor = submitit.AutoExecutor(folder="my_shared_folder") executor.update_parameters(gpus_per_node=2) # submit to the cluster job = executor.submit(add, 5, 7) # will compute add(5, 7) # waits for completion and returns output output = job.result() # 5 + 7 = 12... your addition was computed in the cluster assert output == 12
This Executor interface is similar to the one from the dask.distributed package, albeit at a lower level, giving straightforward access to logs, errors, and handling of checkpointing in case of preemption or timeout (an advanced feature). This shared API makes it possible to convert code in a straightforward manner between running on a Slurm cluster with a submitit.AutoExecutor and locally in multiprocessing (concurrent.futures.ProcessPoolExecutor) or multithreading (concurrent.futures.ThreadPoolExecutor). Submitit can also be configured to run locally for testing, and its plugin system leaves the door open to support new clusters in the future.
Submitit allows researchers to easily switch from small-scale experimentation on their machine to large-scale experiments on the cluster. They can work in a language they are familiar with (Python), and more easily analyze the results and schedule more experiments. The open source version of Submitit will enable our researchers to more easily release and share open source code. If no cluster is found, Submitit will automatically fall back to run experiments locally, which allows third parties to clone the open source code of a FAIR paper and start running small experiments immediately.
Submitit is directly integrated into several of our open source Python projects, including:
Nevergrad: A derivative-free optimization platform that can be used to optimize hyperparameters of neural network training. The main method of Nevergrad optimization can take an optional Executor parameter, so that the optimization can run in parallel locally or on a cluster using concurrent.futures, submitit, or dask.distributed.
Hydra: A framework for elegantly configuring complex applications. This framework supports sweeping on the application parameters (including Nevergrad hyperparameter tuning), and the sweeps can now run on Slurm, thanks to Hydra’s submitit plugin.