The ROCmβ’ Data Center Tool (RDC) simplifies administration and addresses key infrastructure challenges in AMD GPUs within cluster and datacenter environments. RDC offers a suite of features to enhance your GPU management and monitoring.
- GPU Telemetry π
- GPU Statistics for Jobs π
- Integration with Third-Party Tools π
- Open Source π οΈ
For comprehensive documentation and to get started with RDC using pre-built packages, refer to the ROCm Data Center Tool User Guide.
Before setting up RDC, ensure your system meets the following requirements:
- Supported Platforms: RDC runs on AMD ROCm-supported platforms. Refer to the List of Supported Operating Systems for details.
- Dependencies:
For certificate generation, refer to the RDC Developer Handbook (Generate Files for Authentication) or consult the concise guide located at authentication/readme.txt
.
RDC supports two primary modes of operation: Standalone and Embedded. Choose the mode that best fits your deployment needs.
Standalone mode allows RDC to run independently with all its components installed.
-
Start RDCD with Authentication (Monitor-Only Capabilities):
/opt/rocm/bin/rdcd
-
Start RDCD with Authentication (Full Capabilities):
sudo /opt/rocm/bin/rdcd
-
Start RDCD without Authentication (Monitor-Only):
/opt/rocm/bin/rdcd -u
-
Start RDCD without Authentication (Full Capabilities):
sudo /opt/rocm/bin/rdcd -u
Embedded mode integrates RDC directly into your existing management tools using its library format.
-
Run RDC in Embedded Mode:
python your_management_tool.py --rdc_embedded
Note: Ensure that the rdcd
daemon is not running separately when using embedded mode.
-
Copy the Service File:
sudo cp /opt/rocm/libexec/rdc/rdc.service /etc/systemd/system/
-
Configure Capabilities:
-
Full Capabilities: Ensure the following lines are uncommented in
/etc/systemd/system/rdc.service
:CapabilityBoundingSet=CAP_DAC_OVERRIDE AmbientCapabilities=CAP_DAC_OVERRIDE
-
Monitor-Only Capabilities: Comment out the above lines to restrict RDCD to monitoring.
-
-
Start the Service:
sudo systemctl start rdc sudo systemctl status rdc
-
Modify RDCD Options:
Edit
/opt/rocm/share/rdc/conf/rdc_options.conf
to append any additional RDCD parameters.sudo nano /opt/rocm/share/rdc/conf/rdc_options.conf
Example Configuration:
RDC_OPTS="-p 50051 -u -d"
- Flags:
-p 50051
: Use port 50051-u
: Unauthenticated mode-d
: Enable debug messages
- Flags:
If you prefer to build RDC from source, follow the steps below.
Important: RDC requires gRPC and protoc to be built from source as pre-built packages are not available.
-
Install Required Tools:
sudo apt-get update sudo apt-get install automake make cmake g++ unzip build-essential autoconf libtool pkg-config libgflags-dev libgtest-dev clang libc++-dev curl
-
Clone and Build gRPC:
git clone -b v1.61.0 https://github.com/grpc/grpc --depth=1 --shallow-submodules --recurse-submodules cd grpc export GRPC_ROOT=/opt/grpc cmake -B build \ -DgRPC_INSTALL=ON \ -DgRPC_BUILD_TESTS=OFF \ -DBUILD_SHARED_LIBS=ON \ -DCMAKE_SHARED_LINKER_FLAGS_INIT=-Wl,--enable-new-dtags,--build-id=sha1,--rpath,'$ORIGIN' \ -DCMAKE_INSTALL_PREFIX="$GRPC_ROOT" \ -DCMAKE_INSTALL_LIBDIR=lib \ -DCMAKE_BUILD_TYPE=Release make -C build -j $(nproc) sudo make -C build install echo "$GRPC_ROOT" | sudo tee /etc/ld.so.conf.d/grpc.conf sudo ldconfig cd ..
-
Clone the RDC Repository:
git clone https://github.com/ROCm/rdc cd rdc
-
Configure the Build:
cmake -B build -DGRPC_ROOT="$GRPC_ROOT"
- Optional Features:
-
Enable ROCm Profiler:
cmake -B build -DBUILD_PROFILER=ON
-
Enable RVS:
cmake -B build -DBUILD_RVS=ON
-
Build RDC Library Only (without rdci and rdcd):
cmake -B build -DBUILD_STANDALONE=OFF
-
Build RDC Library Without ROCm Run-time:
cmake -B build -DBUILD_RUNTIME=OFF
-
- Optional Features:
-
Build and Install:
make -C build -j $(nproc) sudo make -C build install
-
Update System Library Path:
export RDC_LIB_DIR=/opt/rocm/lib/rdc export GRPC_LIB_DIR="/opt/grpc/lib" echo "${RDC_LIB_DIR}" | sudo tee /etc/ld.so.conf.d/x86_64-librdc_client.conf echo "${GRPC_LIB_DIR}" | sudo tee -a /etc/ld.so.conf.d/x86_64-librdc_client.conf sudo ldconfig
Locate and display information about GPUs present in a compute node.
Example:
rdci discovery <host_name> -l
Output:
2 GPUs found
+-----------+----------------------------------------------+
| GPU Index | Device Information |
+-----------+----------------------------------------------+
| 0 | Name: AMD Radeon Instinct MI50 Accelerator |
| 1 | Name: AMD Radeon Instinct MI50 Accelerator |
+-----------+----------------------------------------------+
Create, delete, and list logical groups of GPUs.
Create a Group:
rdci group -c GPU_GROUP
Add GPUs to Group:
rdci group -g 1 -a 0,1
List Groups:
rdci group -l
Delete a Group:
rdci group -d 1
Manage field groups to monitor specific GPU metrics.
Create a Field Group:
rdci fieldgroup -c <fgroup> -f 150,155
List Field Groups:
rdci fieldgroup -l
Delete a Field Group:
rdci fieldgroup -d 1
Important
Define fields to monitor RAS ECC counters.
-
Correctable ECC Errors:
312 RDC_FI_ECC_CORRECT_TOTAL
-
Uncorrectable ECC Errors:
313 RDC_FI_ECC_UNCORRECT_TOTAL
Monitor GPU fields such as temperature, power usage, and utilization.
Command:
rdci dmon -f <field_group> -g <gpu_group> -c 5 -d 1000
Sample Output:
1 group found
+-----------+-------------+---------------+
| GPU Index | TEMP (mΒ°C) | POWER (Β΅W) |
+-----------+-------------+---------------+
| 0 | 25000 | 520500 |
+-----------+-------------+---------------+
Display GPU statistics for any given workload.
Start Recording Stats:
rdci stats -s 2 -g 1
Stop Recording Stats:
rdci stats -x 2
Display Job Stats:
rdci stats -j 2
Sample Output:
Summary:
Executive Status:
Start time: 1586795401
End time: 1586795445
Total execution time: 44
Energy Consumed (Joules): 21682
Power Usage (Watts): Max: 49 Min: 13 Avg: 34
GPU Clock (MHz): Max: 1000 Min: 300 Avg: 903
GPU Utilization (%): Max: 69 Min: 0 Avg: 2
Max GPU Memory Used (bytes): 524320768
Memory Utilization (%): Max: 12 Min: 11 Avg: 12
Run diagnostics on a GPU group to ensure system health.
Command:
rdci diag -g <gpu_group>
Sample Output:
No compute process: Pass
Node topology check: Pass
GPU parameters check: Pass
Compute Queue ready: Pass
System memory check: Pass
=============== Diagnostic Details ==================
No compute process: No processes running on any devices.
Node topology check: No link detected.
GPU parameters check: GPU 0 Critical Edge temperature in range.
Compute Queue ready: Run binary search task on GPU 0 Pass.
System memory check: Max Single Allocation Memory Test for GPU 0 Pass. CPUAccessToGPUMemoryTest for GPU 0 Pass. GPUAccessToCPUMemoryTest for GPU 0 Pass.
RDC integrates seamlessly with tools like Prometheus, Grafana, and Reliability, Availability, and Serviceability (RAS) to enhance monitoring and visualization.
RDC provides a generic Python class RdcReader
to simplify telemetry gathering.
Sample Program:
from RdcReader import RdcReader
from RdcUtil import RdcUtil
from rdc_bootstrap import *
import time
default_field_ids = [
rdc_field_t.RDC_FI_POWER_USAGE,
rdc_field_t.RDC_FI_GPU_UTIL
]
class SimpleRdcReader(RdcReader):
def __init__(self):
super().__init__(ip_port=None, field_ids=default_field_ids, update_freq=1000000)
def handle_field(self, gpu_index, value):
field_name = self.rdc_util.field_id_string(value.field_id).lower()
print(f"{value.ts} {gpu_index}:{field_name} {value.value.l_int}")
if __name__ == '__main__':
reader = SimpleRdcReader()
while True:
time.sleep(1)
reader.process()
Running the Example:
# Ensure RDC shared libraries are in the library path and RdcReader.py is in PYTHONPATH
python SimpleReader.py
The Prometheus plugin allows you to monitor events and send alerts.
Installation:
-
Install Prometheus Client:
pip install prometheus_client
-
Run the Prometheus Plugin:
python rdc_prometheus.py
-
Verify Plugin:
curl localhost:5000
Integration Steps:
-
Download and Install Prometheus:
-
Configure Prometheus Targets:
- Modify
prometheus_targets.json
to point to your compute nodes.
[ { "targets": [ "rdc_test1.amd.com:5000", "rdc_test2.amd.com:5000" ] } ]
- Modify
-
Start Prometheus with Configuration File:
prometheus --config.file=/path/to/rdc_prometheus_example.yml
-
Access Prometheus UI:
- Open http://localhost:9090 in your browser.
Grafana provides advanced visualization capabilities for RDC metrics.
Installation:
-
Download Grafana:
-
Install Grafana:
- Follow the Installation Instructions.
-
Start Grafana Server:
sudo systemctl start grafana-server sudo systemctl status grafana-server
-
Access Grafana:
- Open http://localhost:3000 in your browser and log in with the default credentials (
admin
/admin
).
- Open http://localhost:3000 in your browser and log in with the default credentials (
Configuration Steps:
-
Add Prometheus Data Source:
- Navigate to Configuration β Data Sources β Add data source β Prometheus.
- Set the URL to http://localhost:9090 and save.
-
Import RDC Dashboard:
- Click the + icon and select Import.
- Upload
rdc_grafana_dashboard_example.json
from thepython_binding
folder. - Select the desired compute node for visualization.
The RAS plugin enables monitoring and counting of ECC (Error-Correcting Code) errors.
Installation:
-
Ensure GPU Supports RAS:
- The GPU must support RAS features.
-
RDC Installation Includes RAS Library:
librdc_ras.so
is located in/opt/rocm-4.2.0/rdc/lib
.
Usage:
-
Monitor ECC Errors:
rdci dmon -i 0 -e 600,601
Sample Output:
GPU ECC_CORRECT ECC_UNCORRECT 0 0 0
Important
-
Missing Libraries:
- Verify
/opt/rocm/lib/rdc/librdc_*.so
exists. - Ensure all related libraries (rocprofiler, rocruntime, etc.) are present.
- Verify
-
Unsupported GPU:
- Most metrics work on MI300 and newer.
- Limited metrics on MI200.
- Consumer GPUs (e.g., RX6800) have fewer supported metrics.
Solution:
Set the HSA_TOOLS_LIB
environment variable before running a compute job.
export HSA_TOOLS_LIB=/opt/rocm/lib/librocprofiler64.so.1
Example:
# Terminal 1
rdcd -u
# Terminal 2
export HSA_TOOLS_LIB=/opt/rocm/lib/librocprofiler64.so.1
gpu-burn
# Terminal 3
rdci dmon -u -e 800,801 -i 0 -c 1
# Output:
GPU OCCUPANCY_PERCENT ACTIVE_WAVES
0 001.000 32640.000
Error Message:
terminate called after throwing an instance of 'std::runtime_error'
what(): hsa error code: 4104 HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
Aborted (core dumped)
Solution:
-
Missing Groups:
- Ensure
video
andrender
groups exist.
sudo usermod -aG video,render $USER
- Log out and log back in to apply group changes.
- Ensure
-
View RDCD Logs:
sudo journalctl -u rdc
-
Run RDCD with Debug Logs:
RDC_LOG=DEBUG /opt/rocm/bin/rdcd
- Logging Levels Supported: ERROR, INFO, DEBUG
-
Enable Additional Logging Messages:
export RSMI_LOGGING=3
RDC is open-source and available under the MIT License.
For support and further inquiries, please refer to the ROCm Documentation or contact the maintainers through the repository's issue tracker.