Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Unable to map memory regions to virtual address space #287

Open
daniandtheweb opened this issue Jan 25, 2025 · 13 comments
Open

[Issue]: Unable to map memory regions to virtual address space #287

daniandtheweb opened this issue Jan 25, 2025 · 13 comments

Comments

@daniandtheweb
Copy link

daniandtheweb commented Jan 25, 2025

Problem Description

When trying to map a memory region to a virtual address space via hipMemMap rocr reports the gpu to be out of memory and it's unable to continue.

This issue can be reproduced using the same reproducer code that's been included in #285.

#include <hip/hip_runtime.h>
#include <stdio.h>

#define HIP_CHECK(fn) { hipError_t err = fn; if(err != hipSuccess){fprintf(stderr, "Error: %s: %s at %d\n", hipGetErrorName(err), hipGetErrorString(err), __LINE__);} }

int main()
{
	size_t granularity;
	hipMemAllocationProp alloc_prop = {};
	alloc_prop.type = hipMemAllocationTypePinned;
	alloc_prop.location.type = hipMemLocationTypeDevice;
	alloc_prop.location.id = 0;
	HIP_CHECK(hipMemGetAllocationGranularity(&granularity, &alloc_prop, hipMemAllocationGranularityRecommended));
	printf("Device recommended granularity %zu\n", granularity);
	
	constexpr size_t maxSize = 1ull << 35; // 32 GB
	hipDeviceptr_t pool_addr = 0;
	HIP_CHECK(hipMemAddressReserve(&pool_addr, maxSize, 0, 0, 0));
	printf("reserved pool at %p\n", pool_addr);
	
	hipMemAllocationProp prop = {};
	prop.type = hipMemAllocationTypePinned;
	prop.location.type = hipMemLocationTypeDevice;
	prop.location.id = 0;
	hipMemGenericAllocationHandle_t handle;
	
	size_t pool_size = 0;
	HIP_CHECK(hipMemCreate(&handle, granularity, &prop, 0));
	HIP_CHECK(hipMemMap(static_cast<char*>(pool_addr) + pool_size, granularity, 0, handle, 0));
	HIP_CHECK(hipMemRelease(handle));
	pool_size += granularity;
	
	HIP_CHECK(hipMemCreate(&handle, granularity, &prop, 0));
	HIP_CHECK(hipMemMap(static_cast<char*>(pool_addr) + pool_size, granularity, 0, handle, 0));
	HIP_CHECK(hipMemRelease(handle));
	pool_size += granularity;
	
	HIP_CHECK(hipMemCreate(&handle, granularity, &prop, 0));
	HIP_CHECK(hipMemMap(static_cast<char*>(pool_addr) + pool_size, granularity, 0, handle, 0));
	HIP_CHECK(hipMemRelease(handle));
	pool_size += granularity;
	
	printf("unmapping %zu at %p\n", pool_size, pool_addr);
	HIP_CHECK(hipMemUnmap(pool_addr, pool_size));
	printf("Freeing virtual space %zu at %p\n", maxSize, pool_addr);
	HIP_CHECK(hipMemAddressFree(pool_addr, maxSize));
	
	return 0;
}

Here's the output:

Device recommended granularity 4096
Error: hipErrorOutOfMemory: out of memory at 18
reserved pool at (nil)
Error: hipErrorInvalidValue: invalid argument at 29
unmapping 12288 at (nil)
Error: hipErrorInvalidValue: invalid argument at 44
Freeing virtual space 34359738368 at (nil)
Error: hipErrorInvalidValue: invalid argument at 46

Operating System

Arch Linux, Mainline Kernel

CPU

Intel(R) Core(TM) i7-9700K

GPU

AMD Radeon RX 5700 XT

ROCm Version

ROCm 6.3.0, ROCm 6.1.0

ROCm Component

ROCR-Runtime, clr

Steps to Reproduce

Run the reproducer

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.14
Runtime Ext Version:     1.6
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz
  Uuid:                    CPU-XX                             
  Marketing Name:          Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3601                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            8                                  
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Memory Properties:       
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    32777644(0x1f425ac) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    32777644(0x1f425ac) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    32777644(0x1f425ac) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 4                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    32777644(0x1f425ac) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1010                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon RX 5700 XT              
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      4096(0x1000) KB                    
  Chip ID:                 29471(0x731f)                      
  ASIC Revision:           2(0x2)                             
  Cacheline Size:          128(0x80)                          
  Max Clock Freq. (MHz):   2100                               
  BDFID:                   1024                               
  Internal Node ID:        1                                  
  Compute Unit:            40                                 
  SIMDs per CU:            2                                  
  Shader Engines:          2                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Memory Properties:       
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    1280(0x500)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 151                                
  SDMA engine uCode::      35                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1010:xnack-  
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***      

Additional Information

This issue has been reported here: ggerganov/llama.cpp#11405

@MangoTCF
Copy link

MangoTCF commented Jan 25, 2025

Experiencing the same.

Operating System

Arch Linux 6.13.0-arch1-1

CPU

AMD Ryzen 7 7840HS w/ Radeon 780M Graphics

GPU

Rx 7700S, Radeon 780M

ROCm Version

6.2.41134-0

Output of /opt/rocm/bin/rocminfo --support

/opt/rocm/bin/rocminfo --support
ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
Runtime Ext Version:     1.6
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3801                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Memory Properties:       
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    28535168(0x1b36980) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    28535168(0x1b36980) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    28535168(0x1b36980) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1102                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon RX 7700S                
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      32(0x20) KB                        
    L2:                      2048(0x800) KB                     
  Chip ID:                 29824(0x7480)                      
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          128(0x80)                          
  Max Clock Freq. (MHz):   2208                               
  BDFID:                   768                                
  Internal Node ID:        1                                  
  Compute Unit:            32                                 
  SIMDs per CU:            2                                  
  Shader Engines:          2                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Memory Properties:       
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 462                                
  SDMA engine uCode::      21                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1102         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*******                  
Agent 3                  
*******                  
  Name:                    gfx1103                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon 780M                    
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      32(0x20) KB                        
    L2:                      2048(0x800) KB                     
  Chip ID:                 5567(0x15bf)                       
  ASIC Revision:           9(0x9)                             
  Cacheline Size:          128(0x80)                          
  Max Clock Freq. (MHz):   2700                               
  BDFID:                   50176                              
  Internal Node ID:        2                                  
  Compute Unit:            12                                 
  SIMDs per CU:            2                                  
  Shader Engines:          1                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Memory Properties:       APU
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 40                                 
  SDMA engine uCode::      21                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    14267584(0xd9b4c0) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    14267584(0xd9b4c0) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1103         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***

@ppanchad-amd
Copy link

Hi @daniandtheweb. Internal ticket has been created to investigate this issue. Thanks!

@tcgu-amd
Copy link

tcgu-amd commented Jan 27, 2025

Hi @daniandtheweb, thanks for reporting the issue. As the error log hints, it is likely caused by the system running out of memory -- the script provided is trying to allocated 32 GB of memory. Would you be able to try changing the following line (L16) to

//constexpr size_t maxSize = 1ull << 35; // 32 GB
constexpr size_t maxSize = 1ull << 32; // 32 GB

and see if the issue persists?

Thanks!

@daniandtheweb
Copy link
Author

daniandtheweb commented Jan 27, 2025

Changing the line to 32 fixes the issue, thanks. However the program still doesn't complete correctly.
In the last portion of the program, when it's supposed to free the virtual space it instead it crashes with a segfault.

./reproducer                                                                                                             4.396s 
Device recommended granularity 4096
reserved pool at 0x747203e00000
unmapping 12288 at 0x747203e00000
zsh: segmentation fault (core dumped)  ./reproducer

Here's the systemd-coredump:

Process 9897 (reproducer) of user 1000 dumped core.
                                                  
                                                  Stack trace of thread 9897:
                                                  #0  0x000074741f346d78 n/a (libamdhip64.so.6 + 0x346d78)
                                                  #1  0x00005fbde60dd39e n/a (n/a + 0x0)
                                                  #2  0x000074741ea34ecc __libc_start_main (libc.so.6 + 0x25ecc)
                                                  #3  0x00005fbde60dd125 n/a (n/a + 0x0)
                                                  ELF object binary architecture: AMD x86-64

Using the second reproducer file that's mentioned in the other issue the program doesn't even manage to unmap the memory.

#include <hip/hip_runtime.h>
#include <stdio.h>

#define HIP_CHECK(fn) { hipError_t err = fn; if(err != hipSuccess){fprintf(stderr, "Error: %s: %s at %d\n", hipGetErrorName(err), hipGetErrorString(err), __LINE__);} }

int main()
{
	size_t granularity;
	hipMemAllocationProp alloc_prop = {};
	alloc_prop.type = hipMemAllocationTypePinned;
	alloc_prop.location.type = hipMemLocationTypeDevice;
	alloc_prop.location.id = 0;
	HIP_CHECK(hipMemGetAllocationGranularity(&granularity, &alloc_prop, hipMemAllocationGranularityRecommended));
	printf("Device recommended granularity %zu\n", granularity);
	
	constexpr size_t maxSize = 1ull << 32; // 32 GB
	hipDeviceptr_t pool_addr = 0;
	HIP_CHECK(hipMemAddressReserve(&pool_addr, maxSize, 0, 0, 0));
	printf("reserved pool at %p\n", pool_addr);
	
	hipMemAllocationProp prop = {};
	prop.type = hipMemAllocationTypePinned;
	prop.location.type = hipMemLocationTypeDevice;
	prop.location.id = 0;
	hipMemGenericAllocationHandle_t handle;
	
	size_t pool_size = 0;
	HIP_CHECK(hipMemCreate(&handle, granularity, &prop, 0));
	HIP_CHECK(hipMemMap(static_cast<char*>(pool_addr) + pool_size, granularity, 0, handle, 0));
	HIP_CHECK(hipMemUnmap(static_cast<char*>(pool_addr) + pool_size, granularity));
	HIP_CHECK(hipMemRelease(handle));
	pool_size += granularity;
	
	HIP_CHECK(hipMemCreate(&handle, granularity, &prop, 0));
	HIP_CHECK(hipMemMap(static_cast<char*>(pool_addr) + pool_size, granularity, 0, handle, 0));
	HIP_CHECK(hipMemUnmap(static_cast<char*>(pool_addr) + pool_size, granularity));
	HIP_CHECK(hipMemRelease(handle));
	pool_size += granularity;
	
	HIP_CHECK(hipMemCreate(&handle, granularity, &prop, 0));
	HIP_CHECK(hipMemMap(static_cast<char*>(pool_addr) + pool_size, granularity, 0, handle, 0));
	HIP_CHECK(hipMemUnmap(static_cast<char*>(pool_addr) + pool_size, granularity));
	HIP_CHECK(hipMemRelease(handle));
	pool_size += granularity;
	
	printf("unmapping %zu at %p\n", pool_size, pool_addr);
	//HIP_CHECK(hipMemUnmap(pool_addr, pool_size));
	printf("Freeing virtual space %zu at %p\n", maxSize, pool_addr);
	HIP_CHECK(hipMemAddressFree(pool_addr, maxSize));
	
	return 0;
}
./reproducer1                                                                                                            4.354s 
Device recommended granularity 4096
reserved pool at 0x780277e00000
zsh: segmentation fault (core dumped)  ./reproducer1

If this is unrelated to rocr I can close the issue.

@tcgu-amd
Copy link

Hi @daniandtheweb, thanks for the update! I am not quite sure the cause of your second error. If I have to guess, it is probably incompatibility between gfx1010 with clr. Unfortunately, we don't have a system at hand where I can reproduce your issue. I only managed to get this

Image

on a system with gfx1100 and the latest ROCm6.3.1, where everything seems to work.

@daniandtheweb
Copy link
Author

My main system currently runs ROCm 6.1.2. I'll update it to a more recent version and test again. Thanks for the help.

@IMbackK
Copy link

IMbackK commented Jan 28, 2025

Hi @daniandtheweb, thanks for reporting the issue. As the error log hints, it is likely caused by the system running out of memory -- the script provided is trying to allocated 32 GB of memory. Would you be able to try changing the following line (L16) to

//constexpr size_t maxSize = 1ull << 35; // 32 GB
constexpr size_t maxSize = 1ull << 32; // 32 GB

and see if the issue persists?

Thanks!

The code is trying to allocate 32GB of virtual address space, not memory, the amount of physical memory on the card should be immaterial. indeed the same code in on cuda platforms allows one to allocate 32GB of address space even when the card has only 4GB or less of physical memory.
Is there guidance on how large the address space of various amd cards is?

@IMbackK
Copy link

IMbackK commented Jan 28, 2025

@tcgu-amd

I can also confirm that the reproducer with 32GB works fine for me on rx6800xt (which ofc also dosent have 32GB of physical memory) and mi100 (which dose)

others of our users report that the reproducer dose not work on rx6700xt and also rx7900xtx, The commonality seams to be that users of the failing tests are on consumer platforms (am4/am5/LGA 1700) while i am on server platform (EPYC rome)

@tcgu-amd
Copy link

@IMbackK Thanks for the additional info! This is actually a great point.

Running the reproducer with HSAKMT_DEBUG_LEVEL=3 shows

 [hsaKmtAllocMemoryAlign] failed to allocate 34359738368 bytes from host

which indicates that it is in fact, the host running out of memory, not the device, which could be the reason why the code works on server but not on consumer platforms.

The message can be traced down to this line of code https://github.com/ROCm/ROCR-Runtime/blob/amd-staging/libhsakmt/src/memory.c#L188.

I am not quite sure if this is the intended behavior, but I will try to look into it. Thanks!

@daniandtheweb
Copy link
Author

Regarding the other issue I was having I can confirm it was caused by the old ROCm version, using ROCm 6.2.2 the reproducer works correcly.

./reproducer                                                                                                             0.023s 
Device recommended granularity 4096
reserved pool at 0x76fbcbe00000
unmapping 12288 at 0x76fbcbe00000
Freeing virtual space 4294967296 at 0x76fbcbe00000

@tcgu-amd
Copy link

Regarding the other issue I was having I can confirm it was caused by the old ROCm version, using ROCm 6.2.2 the reproducer works correcly.

./reproducer                                                                                                             0.023s 
Device recommended granularity 4096
reserved pool at 0x76fbcbe00000
unmapping 12288 at 0x76fbcbe00000
Freeing virtual space 4294967296 at 0x76fbcbe00000

@daniandtheweb Thanks for the update. Glad it is working!

@IMbackK
Copy link

IMbackK commented Jan 30, 2025

which indicates that it is in fact, the host running out of memory, not the device, which could be the reason why the code works on server but not on consumer platforms.

The message can be traced down to this line of code https://github.com/ROCm/ROCR-Runtime/blob/amd-staging/libhsakmt/src/memory.c#L188.

I am not quite sure if this is the intended behavior, but I will try to look into it. Thanks!

ROCR allocating a bunch of ram via malloc on the host when virtual address space is requested is quite strange and makes using vmm to eventually fill the device on machines with more vram than ram impossible.

@tcgu-amd
Copy link

tcgu-amd commented Jan 30, 2025

ROCR allocating a bunch of ram via malloc on the host when virtual address space is requested is quite strange and makes using vmm to eventually fill the device on machines with more vram than ram impossible.

@IMbackK This is definitely a valid point. We are currently holding an investigation into the cause of this behavior. I will keep you posted on the progress. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants