Optimizing PyTorch

History / Edit / PDF / EPUB / BIB /
Created: November 26, 2019 / Updated: July 26, 2021 / Status: in progress / 2 min read (~290 words)

  • Multi-GPU usage
  • Training multiple models on a single GPU in multiple processes

  • Use cases
    • Code is not using 100% of the CPU/RAM
      • Increase batch size
      • Parallelize data loading with GPU computation
    • Data does not fit in GPU RAM
      • Reduce batch size
    • GPU usage is 100% yet there is no progress
      • This might be due to using multithreading and having CPU/GPU trashing occurring
        • In my experiment I've seen 100% GPU usage, not completely sure about CPU usage

  • Data loader
    • Efficient
    • In a separate thread/non-blocking
  • Data transfer between CPU and GPU
    • Minimal
  • Batch size
    • Take as much GPU RAM as possible
  • GPU usage is near 100% (GPU should be your bottleneck)
  • Verify that GPU memory is freed

  • Run your script with python's profiler to determine which part of your script is CPU expensive
    python -m cProfile -o my_profile.prof train.py
  • Run your script with nvprof to determine what is being done on the GPU
    nvprof -o my_profile.nvvp python train.py
  • Free up the memory you used with del (e.g., del my_tensors)
  • If running PyTorch in multiple processes, make sure to configure OMP_NUM_THREADS to a low number as PyTorch uses multithreaded BLAS to do linear algebra on CPU. If this is not specified, the processes will likely attempt to use all cores, which will cause issues since each process will be effectively trying to use all the cores as well