ROCm™ Data Center Tool (RDC) 🚀

The ROCm™ Data Center Tool (RDC) simplifies administration and addresses key infrastructure challenges in AMD GPUs within cluster and datacenter environments. RDC offers a suite of features to enhance your GPU management and monitoring.

🌟 Main Features

GPU Telemetry 📊
GPU Statistics for Jobs 📈
Integration with Third-Party Tools 🔗
Open Source 🛠️

For comprehensive documentation and to get started with RDC using pre-built packages, refer to the ROCm Data Center Tool User Guide.

🛠️ Installation Guide

📋 Prerequisites

Before setting up RDC, ensure your system meets the following requirements:

Supported Platforms: RDC runs on AMD ROCm-supported platforms. Refer to the List of Supported Operating Systems for details.
Dependencies:
- CMake ≥ 3.15
- g++ (5.4.0)
- Doxygen (1.8.11)
- LaTeX (pdfTeX 3.14159265-2.6-1.40.16)
- gRPC and protoc
- libcap-dev
- AMD ROCm Platform (GitHub)
  - AMDSMI Library (GitHub)
  - ROCK Kernel Driver (GitHub)

🔐 Certificate Generation

For certificate generation, refer to the RDC Developer Handbook (Generate Files for Authentication) or consult the concise guide located at authentication/readme.txt.

🚀 Running RDC

RDC supports two primary modes of operation: Standalone and Embedded. Choose the mode that best fits your deployment needs.

🗂️ Standalone Mode

Standalone mode allows RDC to run independently with all its components installed.

Start RDCD with Authentication (Monitor-Only Capabilities):
```
/opt/rocm/bin/rdcd
```
Start RDCD with Authentication (Full Capabilities):
```
sudo /opt/rocm/bin/rdcd
```
Start RDCD without Authentication (Monitor-Only):
```
/opt/rocm/bin/rdcd -u
```
Start RDCD without Authentication (Full Capabilities):
```
sudo /opt/rocm/bin/rdcd -u
```

🔗 Embedded Mode

Embedded mode integrates RDC directly into your existing management tools using its library format.

Run RDC in Embedded Mode:

python your_management_tool.py --rdc_embedded

Note: Ensure that the rdcd daemon is not running separately when using embedded mode.

🛠️ Starting RDCD Using systemd

Copy the Service File:

sudo cp /opt/rocm/libexec/rdc/rdc.service /etc/systemd/system/

Configure Capabilities:
- Full Capabilities: Ensure the following lines are uncommented in /etc/systemd/system/rdc.service:
```
CapabilityBoundingSet=CAP_DAC_OVERRIDE
AmbientCapabilities=CAP_DAC_OVERRIDE
```
- Monitor-Only Capabilities: Comment out the above lines to restrict RDCD to monitoring.

Start the Service:

sudo systemctl start rdc
sudo systemctl status rdc

Modify RDCD Options:

Edit /opt/rocm/share/rdc/conf/rdc_options.conf to append any additional RDCD parameters.
```
sudo nano /opt/rocm/share/rdc/conf/rdc_options.conf
```
Example Configuration:
```
RDC_OPTS="-p 50051 -u -d"
```
- Flags:
  - -p 50051 : Use port 50051
  - -u : Unauthenticated mode
  - -d : Enable debug messages

🏗️ Building RDC from Source

If you prefer to build RDC from source, follow the steps below.

🔧 Building gRPC and protoc

Important: RDC requires gRPC and protoc to be built from source as pre-built packages are not available.

Install Required Tools:

sudo apt-get update
sudo apt-get install automake make cmake g++ unzip build-essential autoconf libtool pkg-config libgflags-dev libgtest-dev clang libc++-dev curl

Clone and Build gRPC:

git clone -b v1.61.0 https://github.com/grpc/grpc --depth=1 --shallow-submodules --recurse-submodules
cd grpc
export GRPC_ROOT=/opt/grpc
cmake -B build \
    -DgRPC_INSTALL=ON \
    -DgRPC_BUILD_TESTS=OFF \
    -DBUILD_SHARED_LIBS=ON \
    -DCMAKE_SHARED_LINKER_FLAGS_INIT=-Wl,--enable-new-dtags,--build-id=sha1,--rpath,'$ORIGIN' \
    -DCMAKE_INSTALL_PREFIX="$GRPC_ROOT" \
    -DCMAKE_INSTALL_LIBDIR=lib \
    -DCMAKE_BUILD_TYPE=Release
make -C build -j $(nproc)
sudo make -C build install
echo "$GRPC_ROOT" | sudo tee /etc/ld.so.conf.d/grpc.conf
sudo ldconfig
cd ..

🔧 Building RDC

Clone the RDC Repository:

git clone https://github.com/ROCm/rdc
cd rdc

Configure the Build:

cmake -B build -DGRPC_ROOT="$GRPC_ROOT"

Optional Features:
- Enable ROCm Profiler:
```
cmake -B build -DBUILD_PROFILER=ON
```
- Enable RVS:
```
cmake -B build -DBUILD_RVS=ON
```
- Build RDC Library Only (without rdci and rdcd):
```
cmake -B build -DBUILD_STANDALONE=OFF
```
- Build RDC Library Without ROCm Run-time:
```
cmake -B build -DBUILD_RUNTIME=OFF
```

Build and Install:

make -C build -j $(nproc)
sudo make -C build install

Update System Library Path:

export RDC_LIB_DIR=/opt/rocm/lib/rdc
export GRPC_LIB_DIR="/opt/grpc/lib"
echo "${RDC_LIB_DIR}" | sudo tee /etc/ld.so.conf.d/x86_64-librdc_client.conf
echo "${GRPC_LIB_DIR}" | sudo tee -a /etc/ld.so.conf.d/x86_64-librdc_client.conf
sudo ldconfig

📊 Features Overview

🔍 Discovery

Locate and display information about GPUs present in a compute node.

Example:

rdci discovery <host_name> -l

Output:

2 GPUs found

+-----------+----------------------------------------------+
| GPU Index | Device Information                           |
+-----------+----------------------------------------------+
| 0         | Name: AMD Radeon Instinct MI50 Accelerator   |
| 1         | Name: AMD Radeon Instinct MI50 Accelerator   |
+-----------+----------------------------------------------+

👥 Groups

🖥️ GPU Groups

Create, delete, and list logical groups of GPUs.

Create a Group:

rdci group -c GPU_GROUP

Add GPUs to Group:

rdci group -g 1 -a 0,1

List Groups:

rdci group -l

Delete a Group:

rdci group -d 1

🗂️ Field Groups

Manage field groups to monitor specific GPU metrics.

Create a Field Group:

rdci fieldgroup -c <fgroup> -f 150,155

List Field Groups:

rdci fieldgroup -l

Delete a Field Group:

rdci fieldgroup -d 1

Important

🛑 Monitor Errors

Define fields to monitor RAS ECC counters.

Correctable ECC Errors:
```
312 RDC_FI_ECC_CORRECT_TOTAL
```
Uncorrectable ECC Errors:
```
313 RDC_FI_ECC_UNCORRECT_TOTAL
```

📈 Device Monitoring

Monitor GPU fields such as temperature, power usage, and utilization.

Command:

rdci dmon -f <field_group> -g <gpu_group> -c 5 -d 1000

Sample Output:

1 group found

+-----------+-------------+---------------+
| GPU Index | TEMP (m°C)  | POWER (µW)    |
+-----------+-------------+---------------+
| 0         | 25000       | 520500        |
+-----------+-------------+---------------+

📊 Job Stats

Display GPU statistics for any given workload.

Start Recording Stats:

rdci stats -s 2 -g 1

Stop Recording Stats:

rdci stats -x 2

Display Job Stats:

rdci stats -j 2

Sample Output:

Summary:
Executive Status:

Start time: 1586795401
End time: 1586795445
Total execution time: 44

Energy Consumed (Joules): 21682
Power Usage (Watts): Max: 49 Min: 13 Avg: 34
GPU Clock (MHz): Max: 1000 Min: 300 Avg: 903
GPU Utilization (%): Max: 69 Min: 0 Avg: 2
Max GPU Memory Used (bytes): 524320768
Memory Utilization (%): Max: 12 Min: 11 Avg: 12

🩺 Diagnostic

Run diagnostics on a GPU group to ensure system health.

Command:

rdci diag -g <gpu_group>

Sample Output:

No compute process:  Pass
Node topology check:  Pass
GPU parameters check:  Pass
Compute Queue ready:  Pass
System memory check:  Pass
=============== Diagnostic Details ==================
No compute process:  No processes running on any devices.
Node topology check:  No link detected.
GPU parameters check:  GPU 0 Critical Edge temperature in range.
Compute Queue ready:  Run binary search task on GPU 0 Pass.
System memory check:  Max Single Allocation Memory Test for GPU 0 Pass. CPUAccessToGPUMemoryTest for GPU 0 Pass. GPUAccessToCPUMemoryTest for GPU 0 Pass.

🔌 Integration with Third-Party Tools

RDC integrates seamlessly with tools like Prometheus, Grafana, and Reliability, Availability, and Serviceability (RAS) to enhance monitoring and visualization.

🐍 Python Bindings

RDC provides a generic Python class RdcReader to simplify telemetry gathering.

Sample Program:

from RdcReader import RdcReader
from RdcUtil import RdcUtil
from rdc_bootstrap import *
import time

default_field_ids = [
    rdc_field_t.RDC_FI_POWER_USAGE,
    rdc_field_t.RDC_FI_GPU_UTIL
]

class SimpleRdcReader(RdcReader):
    def __init__(self):
        super().__init__(ip_port=None, field_ids=default_field_ids, update_freq=1000000)

    def handle_field(self, gpu_index, value):
        field_name = self.rdc_util.field_id_string(value.field_id).lower()
        print(f"{value.ts} {gpu_index}:{field_name} {value.value.l_int}")

if __name__ == '__main__':
    reader = SimpleRdcReader()
    while True:
        time.sleep(1)
        reader.process()

Running the Example:

# Ensure RDC shared libraries are in the library path and RdcReader.py is in PYTHONPATH
python SimpleReader.py

📈 Prometheus Plugin

The Prometheus plugin allows you to monitor events and send alerts.

Installation:

Install Prometheus Client:
```
pip install prometheus_client
```
Run the Prometheus Plugin:
```
python rdc_prometheus.py
```
Verify Plugin:
```
curl localhost:5000
```

Integration Steps:

Download and Install Prometheus:
- Prometheus GitHub

Configure Prometheus Targets:

Modify prometheus_targets.json to point to your compute nodes.

[
  {
    "targets": [
      "rdc_test1.amd.com:5000",
      "rdc_test2.amd.com:5000"
    ]
  }
]

Start Prometheus with Configuration File:

prometheus --config.file=/path/to/rdc_prometheus_example.yml

Access Prometheus UI:
- Open http://localhost:9090 in your browser.

📊 Grafana Integration

Grafana provides advanced visualization capabilities for RDC metrics.

Installation:

Download Grafana:
- Grafana Download
Install Grafana:
- Follow the Installation Instructions.

Start Grafana Server:

sudo systemctl start grafana-server
sudo systemctl status grafana-server

Access Grafana:
- Open http://localhost:3000 in your browser and log in with the default credentials (admin/admin).

Configuration Steps:

Add Prometheus Data Source:
- Navigate to Configuration → Data Sources → Add data source → Prometheus.
- Set the URL to http://localhost:9090 and save.
Import RDC Dashboard:
- Click the + icon and select Import.
- Upload rdc_grafana_dashboard_example.json from the python_binding folder.
- Select the desired compute node for visualization.

🛡️ Reliability, Availability, and Serviceability (RAS) Plugin

The RAS plugin enables monitoring and counting of ECC (Error-Correcting Code) errors.

Installation:

Ensure GPU Supports RAS:
- The GPU must support RAS features.
RDC Installation Includes RAS Library:
- librdc_ras.so is located in /opt/rocm-4.2.0/rdc/lib.

Usage:

Monitor ECC Errors:

rdci dmon -i 0 -e 600,601

Sample Output:

GPU     ECC_CORRECT         ECC_UNCORRECT
0       0                   0

Important

🐞 Troubleshooting

Known Issues

🛑 dmon Fields Return N/A

Missing Libraries:
- Verify /opt/rocm/lib/rdc/librdc_*.so exists.
- Ensure all related libraries (rocprofiler, rocruntime, etc.) are present.
Unsupported GPU:
- Most metrics work on MI300 and newer.
- Limited metrics on MI200.
- Consumer GPUs (e.g., RX6800) have fewer supported metrics.

🐍 dmon RocProfiler Fields Return Zeros

Solution:

Set the HSA_TOOLS_LIB environment variable before running a compute job.

export HSA_TOOLS_LIB=/opt/rocm/lib/librocprofiler64.so.1

Example:

# Terminal 1
rdcd -u

# Terminal 2
export HSA_TOOLS_LIB=/opt/rocm/lib/librocprofiler64.so.1
gpu-burn

# Terminal 3
rdci dmon -u -e 800,801 -i 0 -c 1

# Output:
GPU   OCCUPANCY_PERCENT   ACTIVE_WAVES
0     001.000             32640.000

⚠️ `HSA_STATUS_ERROR_OUT_OF_RESOURCES`

Error Message:

terminate called after throwing an instance of 'std::runtime_error'
 what():  hsa error code: 4104 HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
Aborted (core dumped)

Solution:

Missing Groups:
- Ensure video and render groups exist.
```
sudo usermod -aG video,render $USER
```
- Log out and log back in to apply group changes.

🐛 Troubleshooting RDCD

View RDCD Logs:
```
sudo journalctl -u rdc
```
Run RDCD with Debug Logs:
```
RDC_LOG=DEBUG /opt/rocm/bin/rdcd
```
- Logging Levels Supported: ERROR, INFO, DEBUG
Enable Additional Logging Messages:
```
export RSMI_LOGGING=3
```

📄 License

RDC is open-source and available under the MIT License.

📧 Support

For support and further inquiries, please refer to the ROCm Documentation or contact the maintainers through the repository's issue tracker.

Name		Name	Last commit message	Last commit date
Latest commit History 286 Commits
.azuredevops		.azuredevops
.github		.github
DEBIAN		DEBIAN
authentication		authentication
cmake_modules		cmake_modules
common		common
docs		docs
example		example
include		include
protos		protos
python_binding		python_binding
rdc_libs		rdc_libs
rdci		rdci
server		server
src		src
tests		tests
.clang-format		.clang-format
.editorconfig		.editorconfig
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
CPPLINT.cfg		CPPLINT.cfg
LICENSE		LICENSE
README.md		README.md
lychee.toml		lychee.toml
run_github_actions_locally.sh		run_github_actions_locally.sh

License

ROCm/rdc

Folders and files

Latest commit

History

Repository files navigation

ROCm™ Data Center Tool (RDC) 🚀

🌟 Main Features

🛠️ Installation Guide

📋 Prerequisites

🔐 Certificate Generation

🚀 Running RDC

🗂️ Standalone Mode

🔗 Embedded Mode

🛠️ Starting RDCD Using systemd

🏗️ Building RDC from Source

🔧 Building gRPC and protoc

🔧 Building RDC

📊 Features Overview

🔍 Discovery

👥 Groups

🖥️ GPU Groups

🗂️ Field Groups

🛑 Monitor Errors

📈 Device Monitoring

📊 Job Stats

🩺 Diagnostic

🔌 Integration with Third-Party Tools

🐍 Python Bindings

📈 Prometheus Plugin

📊 Grafana Integration

🛡️ Reliability, Availability, and Serviceability (RAS) Plugin

🐞 Troubleshooting

Known Issues

🛑 dmon Fields Return N/A

🐍 dmon RocProfiler Fields Return Zeros

⚠️ HSA_STATUS_ERROR_OUT_OF_RESOURCES

🐛 Troubleshooting RDCD

📄 License

📧 Support

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 40

Packages 0

Contributors 27

Languages

⚠️ `HSA_STATUS_ERROR_OUT_OF_RESOURCES`

Packages