You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I ran into OOM crash with my 4xA100 80GB machine after running about 24 hours on a vcf with 5.4M records. How can I reduce VRAM usage to prevent the crash?
2023-10-25 01:41:26.964429: I tensorflow/core/common_runtime/bfc_allocator.cc:1004] total_region_allocated_bytes_: 79465996800 memory_limit_: 79465996800 available bytes: 0 curr_region_allocation_bytes_: 158931993600
2023-10-25 01:41:26.964440: I tensorflow/core/common_runtime/bfc_allocator.cc:1010] Stats:
Limit: 79465996800
InUse: 52721177600
MaxInUse: 72756299520
NumAllocs: 19296031
MaxAllocSize: 6509670400
2023-10-25 01:41:26.964455: W tensorflow/core/common_runtime/bfc_allocator.cc:439] ***********************************_______**********_______*****************_______*********________
2023-10-25 01:41:26.964503: W tensorflow/core/framework/op_kernel.cc:1753] OP_REQUIRES failed at spacetobatch_op.cc:219 : Resource exhausted: OOM when allocating tensor with shape[42100,1224,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
^M-------------------------------------------------------------------------------- (0%)^M#------------------------------------------------------------------------------- (1%)^M##------------------------------------------------------------------------------ (2%)^M###----------------------------------------------------------------------------- (3%)^M####---------------------------------------------------------------------------- (5%)^M#####--------------------------------------------------------------------------- (6%)^M######-------------------------------------------------------------------------- (7%)^M#######------------------------------------------------------------------------- (8%)^M########------------------------------------------------------------------------ (10%)^M#########----------------------------------------------------------------------- (11%)^M##########---------------------------------------------------------------------- (12%)^M###########--------------------------------------------------------------------- (13%)^M############-------------------------------------------------------------------- (15%)^M#############------------------------------------------------------------------- (16%)Traceback (most recent call last):
File "/nfs/home/abc/miniconda3/envs/py38gpu/bin/cis-vcf", line 8, in <module>
sys.exit(vcf())
File "/nfs/home/abc/miniconda3/envs/py38gpu/lib/python3.8/site-packages/cispliceai/console.py", line 46, in vcf
annotator.annotate_vcf(
File "/nfs/home/abc/miniconda3/envs/py38gpu/lib/python3.8/site-packages/cispliceai/annotation.py", line 440, in annotate_vcf
jobs = self.run_jobs(jobs)
File "/nfs/home/abc/miniconda3/envs/py38gpu/lib/python3.8/site-packages/cispliceai/annotation.py", line 299, in run_jobs
self._run_batches(jobs_to_run)
File "/nfs/home/abc/miniconda3/envs/py38gpu/lib/python3.8/site-packages/cispliceai/annotation.py", line 324, in _run_batches
preds = self._model.predict(x_ref + x_var)
File "/nfs/home/abc/miniconda3/envs/py38gpu/lib/python3.8/site-packages/cispliceai/model.py", line 29, in predict
preds = self._model(padded_batch)
File "/nfs/home/abc/miniconda3/envs/py38gpu/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1605, in __call__
return self._call_impl(args, kwargs)
File "/nfs/home/abc/miniconda3/envs/py38gpu/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1645, in _call_impl
return self._call_flat(args, self.captured_inputs, cancellation_manager)
File "/nfs/home/abc/miniconda3/envs/py38gpu/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1745, in _call_flat
return self._build_call_outputs(self._inference_function.call(
File "/nfs/home/abc/miniconda3/envs/py38gpu/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 593, in call
outputs = execute.execute(
File "/nfs/home/abc/miniconda3/envs/py38gpu/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[4210,32,1,12038] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node model_1/conv1d_2/conv1d_1-0-TransposeNHWCToNCHW-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[Op:__inference_pruned_22774]
Function call stack:
pruned
The text was updated successfully, but these errors were encountered:
So generally the parameter to manage VRAM usage is --batch, which defaults to 10 (in MB). So generally, decreasing this parameter should decrease VRAM usage.
However, if the program runs fine for 24 hours and crashes after such long time, I can see three cases:
Another process used the GPU. In that case, well, don't start another GPU process.
There's an unlucky batch of variants that loads your GPU to the max (the --batch value defines the maximum batch and the value depends on REF/ALT annotations and the batches itself). In that case, decreasing the batch size will solve the issue.
VRAM usage increases steadily over time due to a memory leak. That would be terrible, and something you can see when monitoring VRAM over time - if it constantly increases over the 24 hours until it crashes, that's bad. Not something I can fix; in that case I recommend batching your variants into smaller files so that the GPU is reset periodically.
Let's hope that decreasing the batch size helps you already, do let me know your progress!
I ran into OOM crash with my 4xA100 80GB machine after running about 24 hours on a vcf with 5.4M records. How can I reduce VRAM usage to prevent the crash?
The text was updated successfully, but these errors were encountered: