Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to reduce VRAM usage? #5

Open
ymcki opened this issue Oct 25, 2023 · 1 comment
Open

How to reduce VRAM usage? #5

ymcki opened this issue Oct 25, 2023 · 1 comment

Comments

@ymcki
Copy link

ymcki commented Oct 25, 2023

I ran into OOM crash with my 4xA100 80GB machine after running about 24 hours on a vcf with 5.4M records. How can I reduce VRAM usage to prevent the crash?

2023-10-25 01:41:26.964429: I tensorflow/core/common_runtime/bfc_allocator.cc:1004] total_region_allocated_bytes_: 79465996800 memory_limit_: 79465996800 available bytes: 0 curr_region_allocation_bytes_: 158931993600
2023-10-25 01:41:26.964440: I tensorflow/core/common_runtime/bfc_allocator.cc:1010] Stats:
Limit:                 79465996800
InUse:                 52721177600
MaxInUse:              72756299520
NumAllocs:                19296031
MaxAllocSize:           6509670400

2023-10-25 01:41:26.964455: W tensorflow/core/common_runtime/bfc_allocator.cc:439] ***********************************_______**********_______*****************_______*********________
2023-10-25 01:41:26.964503: W tensorflow/core/framework/op_kernel.cc:1753] OP_REQUIRES failed at spacetobatch_op.cc:219 : Resource exhausted: OOM when allocating tensor with shape[42100,1224,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
^M-------------------------------------------------------------------------------- (0%)^M#------------------------------------------------------------------------------- (1%)^M##------------------------------------------------------------------------------ (2%)^M###----------------------------------------------------------------------------- (3%)^M####---------------------------------------------------------------------------- (5%)^M#####--------------------------------------------------------------------------- (6%)^M######-------------------------------------------------------------------------- (7%)^M#######------------------------------------------------------------------------- (8%)^M########------------------------------------------------------------------------ (10%)^M#########----------------------------------------------------------------------- (11%)^M##########---------------------------------------------------------------------- (12%)^M###########--------------------------------------------------------------------- (13%)^M############-------------------------------------------------------------------- (15%)^M#############------------------------------------------------------------------- (16%)Traceback (most recent call last):
  File "/nfs/home/abc/miniconda3/envs/py38gpu/bin/cis-vcf", line 8, in <module>
    sys.exit(vcf())
  File "/nfs/home/abc/miniconda3/envs/py38gpu/lib/python3.8/site-packages/cispliceai/console.py", line 46, in vcf
    annotator.annotate_vcf(
  File "/nfs/home/abc/miniconda3/envs/py38gpu/lib/python3.8/site-packages/cispliceai/annotation.py", line 440, in annotate_vcf
    jobs = self.run_jobs(jobs)
  File "/nfs/home/abc/miniconda3/envs/py38gpu/lib/python3.8/site-packages/cispliceai/annotation.py", line 299, in run_jobs
    self._run_batches(jobs_to_run)
  File "/nfs/home/abc/miniconda3/envs/py38gpu/lib/python3.8/site-packages/cispliceai/annotation.py", line 324, in _run_batches
    preds = self._model.predict(x_ref + x_var)
  File "/nfs/home/abc/miniconda3/envs/py38gpu/lib/python3.8/site-packages/cispliceai/model.py", line 29, in predict
    preds = self._model(padded_batch)
  File "/nfs/home/abc/miniconda3/envs/py38gpu/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1605, in __call__
    return self._call_impl(args, kwargs)
  File "/nfs/home/abc/miniconda3/envs/py38gpu/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1645, in _call_impl
    return self._call_flat(args, self.captured_inputs, cancellation_manager)
  File "/nfs/home/abc/miniconda3/envs/py38gpu/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1745, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "/nfs/home/abc/miniconda3/envs/py38gpu/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 593, in call
    outputs = execute.execute(
  File "/nfs/home/abc/miniconda3/envs/py38gpu/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.ResourceExhaustedError:  OOM when allocating tensor with shape[4210,32,1,12038] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node model_1/conv1d_2/conv1d_1-0-TransposeNHWCToNCHW-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_pruned_22774]

Function call stack:
pruned
@YStrauch
Copy link
Owner

So generally the parameter to manage VRAM usage is --batch, which defaults to 10 (in MB). So generally, decreasing this parameter should decrease VRAM usage.

However, if the program runs fine for 24 hours and crashes after such long time, I can see three cases:

  1. Another process used the GPU. In that case, well, don't start another GPU process.
  2. There's an unlucky batch of variants that loads your GPU to the max (the --batch value defines the maximum batch and the value depends on REF/ALT annotations and the batches itself). In that case, decreasing the batch size will solve the issue.
  3. VRAM usage increases steadily over time due to a memory leak. That would be terrible, and something you can see when monitoring VRAM over time - if it constantly increases over the 24 hours until it crashes, that's bad. Not something I can fix; in that case I recommend batching your variants into smaller files so that the GPU is reset periodically.

Let's hope that decreasing the batch size helps you already, do let me know your progress!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants