How to reduce VRAM usage? #5

ymcki · 2023-10-25T01:18:32Z

I ran into OOM crash with my 4xA100 80GB machine after running about 24 hours on a vcf with 5.4M records. How can I reduce VRAM usage to prevent the crash?

2023-10-25 01:41:26.964429: I tensorflow/core/common_runtime/bfc_allocator.cc:1004] total_region_allocated_bytes_: 79465996800 memory_limit_: 79465996800 available bytes: 0 curr_region_allocation_bytes_: 158931993600
2023-10-25 01:41:26.964440: I tensorflow/core/common_runtime/bfc_allocator.cc:1010] Stats:
Limit:                 79465996800
InUse:                 52721177600
MaxInUse:              72756299520
NumAllocs:                19296031
MaxAllocSize:           6509670400

2023-10-25 01:41:26.964455: W tensorflow/core/common_runtime/bfc_allocator.cc:439] ***********************************_______**********_______*****************_______*********________
2023-10-25 01:41:26.964503: W tensorflow/core/framework/op_kernel.cc:1753] OP_REQUIRES failed at spacetobatch_op.cc:219 : Resource exhausted: OOM when allocating tensor with shape[42100,1224,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
^M-------------------------------------------------------------------------------- (0%)^M#------------------------------------------------------------------------------- (1%)^M##------------------------------------------------------------------------------ (2%)^M###----------------------------------------------------------------------------- (3%)^M####---------------------------------------------------------------------------- (5%)^M#####--------------------------------------------------------------------------- (6%)^M######-------------------------------------------------------------------------- (7%)^M#######------------------------------------------------------------------------- (8%)^M########------------------------------------------------------------------------ (10%)^M#########----------------------------------------------------------------------- (11%)^M##########---------------------------------------------------------------------- (12%)^M###########--------------------------------------------------------------------- (13%)^M############-------------------------------------------------------------------- (15%)^M#############------------------------------------------------------------------- (16%)Traceback (most recent call last):
  File "/nfs/home/abc/miniconda3/envs/py38gpu/bin/cis-vcf", line 8, in <module>
    sys.exit(vcf())
  File "/nfs/home/abc/miniconda3/envs/py38gpu/lib/python3.8/site-packages/cispliceai/console.py", line 46, in vcf
    annotator.annotate_vcf(
  File "/nfs/home/abc/miniconda3/envs/py38gpu/lib/python3.8/site-packages/cispliceai/annotation.py", line 440, in annotate_vcf
    jobs = self.run_jobs(jobs)
  File "/nfs/home/abc/miniconda3/envs/py38gpu/lib/python3.8/site-packages/cispliceai/annotation.py", line 299, in run_jobs
    self._run_batches(jobs_to_run)
  File "/nfs/home/abc/miniconda3/envs/py38gpu/lib/python3.8/site-packages/cispliceai/annotation.py", line 324, in _run_batches
    preds = self._model.predict(x_ref + x_var)
  File "/nfs/home/abc/miniconda3/envs/py38gpu/lib/python3.8/site-packages/cispliceai/model.py", line 29, in predict
    preds = self._model(padded_batch)
  File "/nfs/home/abc/miniconda3/envs/py38gpu/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1605, in __call__
    return self._call_impl(args, kwargs)
  File "/nfs/home/abc/miniconda3/envs/py38gpu/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1645, in _call_impl
    return self._call_flat(args, self.captured_inputs, cancellation_manager)
  File "/nfs/home/abc/miniconda3/envs/py38gpu/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1745, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "/nfs/home/abc/miniconda3/envs/py38gpu/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 593, in call
    outputs = execute.execute(
  File "/nfs/home/abc/miniconda3/envs/py38gpu/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.ResourceExhaustedError:  OOM when allocating tensor with shape[4210,32,1,12038] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node model_1/conv1d_2/conv1d_1-0-TransposeNHWCToNCHW-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_pruned_22774]

Function call stack:
pruned

The text was updated successfully, but these errors were encountered:

YStrauch · 2023-10-26T14:37:53Z

So generally the parameter to manage VRAM usage is --batch, which defaults to 10 (in MB). So generally, decreasing this parameter should decrease VRAM usage.

However, if the program runs fine for 24 hours and crashes after such long time, I can see three cases:

Another process used the GPU. In that case, well, don't start another GPU process.
There's an unlucky batch of variants that loads your GPU to the max (the --batch value defines the maximum batch and the value depends on REF/ALT annotations and the batches itself). In that case, decreasing the batch size will solve the issue.
VRAM usage increases steadily over time due to a memory leak. That would be terrible, and something you can see when monitoring VRAM over time - if it constantly increases over the 24 hours until it crashes, that's bad. Not something I can fix; in that case I recommend batching your variants into smaller files so that the GPU is reset periodically.

Let's hope that decreasing the batch size helps you already, do let me know your progress!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to reduce VRAM usage? #5

How to reduce VRAM usage? #5

ymcki commented Oct 25, 2023

YStrauch commented Oct 26, 2023

How to reduce VRAM usage? #5

How to reduce VRAM usage? #5

Comments

ymcki commented Oct 25, 2023

YStrauch commented Oct 26, 2023