/tag/slurm

  • CPU Memory Best Practices

    Efficient CPU memory usage helps ensure that the shared cluster resources remain available for all users. Requesting too much memory can lead to longer queue times (for you and others), while requesting too little may cause jobs to fail.
    Aim to request an appropriate amount memory for all of your jobs.
    • Target utilization: ~80–90% of requested memory
    If you are running many similar jobs (e.g., job arrays, parameter sweeps, workflows processing many different samples, etc.), it is especially important to estimate memory needs before scaling up.
    Why this matters
    Submitting hundreds or thousands of jobs with overestimated memory can:

  • GPU Best Practices

    Core Principles Measure before you scale. Always take a short, single‑GPU baseline and record simple metrics. Right‑size, don’t over‑ask. Request only the GPUs/CPUs/RAM and walltime your measurements justify. Keep GPUs busy. If utilization is low, fix input/data issues before adding more GPUs. Short interactive, long batch. Use OOD for quick experiments; move long work to SLURM. Be a good citizen. Release idle sessions, clean up scratch, and prefer storage patterns that reduce system load. Right‑Sizing in 5 Steps Baseline (≤5 minutes): Run a tiny slice on 1 GPU. Note: Throughput (samples/s or tokens/s) GPU utilization and memory usage Any stalls from CPU or I/O Find the knee: Increase batch size and enable mixed precision if supported.