Skip to content

ROCm/rdc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ROCmβ„’ Data Center Tool (RDC) πŸš€

The ROCmβ„’ Data Center Tool (RDC) simplifies administration and addresses key infrastructure challenges in AMD GPUs within cluster and datacenter environments. RDC offers a suite of features to enhance your GPU management and monitoring.

🌟 Main Features

  • GPU Telemetry πŸ“Š
  • GPU Statistics for Jobs πŸ“ˆ
  • Integration with Third-Party Tools πŸ”—
  • Open Source πŸ› οΈ

For comprehensive documentation and to get started with RDC using pre-built packages, refer to the ROCm Data Center Tool User Guide.


πŸ› οΈ Installation Guide

πŸ“‹ Prerequisites

Before setting up RDC, ensure your system meets the following requirements:

  • Supported Platforms: RDC runs on AMD ROCm-supported platforms. Refer to the List of Supported Operating Systems for details.
  • Dependencies:
    • CMake β‰₯ 3.15
    • g++ (5.4.0)
    • Doxygen (1.8.11)
    • LaTeX (pdfTeX 3.14159265-2.6-1.40.16)
    • gRPC and protoc
    • libcap-dev
    • AMD ROCm Platform (GitHub)

πŸ” Certificate Generation

For certificate generation, refer to the RDC Developer Handbook (Generate Files for Authentication) or consult the concise guide located at authentication/readme.txt.


πŸš€ Running RDC

RDC supports two primary modes of operation: Standalone and Embedded. Choose the mode that best fits your deployment needs.

πŸ—‚οΈ Standalone Mode

Standalone mode allows RDC to run independently with all its components installed.

  1. Start RDCD with Authentication (Monitor-Only Capabilities):

    /opt/rocm/bin/rdcd
  2. Start RDCD with Authentication (Full Capabilities):

    sudo /opt/rocm/bin/rdcd
  3. Start RDCD without Authentication (Monitor-Only):

    /opt/rocm/bin/rdcd -u
  4. Start RDCD without Authentication (Full Capabilities):

    sudo /opt/rocm/bin/rdcd -u

πŸ”— Embedded Mode

Embedded mode integrates RDC directly into your existing management tools using its library format.

  • Run RDC in Embedded Mode:

    python your_management_tool.py --rdc_embedded

Note: Ensure that the rdcd daemon is not running separately when using embedded mode.

πŸ› οΈ Starting RDCD Using systemd

  1. Copy the Service File:

    sudo cp /opt/rocm/libexec/rdc/rdc.service /etc/systemd/system/
  2. Configure Capabilities:

    • Full Capabilities: Ensure the following lines are uncommented in /etc/systemd/system/rdc.service:

      CapabilityBoundingSet=CAP_DAC_OVERRIDE
      AmbientCapabilities=CAP_DAC_OVERRIDE
    • Monitor-Only Capabilities: Comment out the above lines to restrict RDCD to monitoring.

  3. Start the Service:

    sudo systemctl start rdc
    sudo systemctl status rdc
  4. Modify RDCD Options:

    Edit /opt/rocm/share/rdc/conf/rdc_options.conf to append any additional RDCD parameters.

    sudo nano /opt/rocm/share/rdc/conf/rdc_options.conf

    Example Configuration:

    RDC_OPTS="-p 50051 -u -d"
    • Flags:
      • -p 50051 : Use port 50051
      • -u : Unauthenticated mode
      • -d : Enable debug messages

πŸ—οΈ Building RDC from Source

If you prefer to build RDC from source, follow the steps below.

πŸ”§ Building gRPC and protoc

Important: RDC requires gRPC and protoc to be built from source as pre-built packages are not available.

  1. Install Required Tools:

    sudo apt-get update
    sudo apt-get install automake make cmake g++ unzip build-essential autoconf libtool pkg-config libgflags-dev libgtest-dev clang libc++-dev curl
  2. Clone and Build gRPC:

    git clone -b v1.61.0 https://github.com/grpc/grpc --depth=1 --shallow-submodules --recurse-submodules
    cd grpc
    export GRPC_ROOT=/opt/grpc
    cmake -B build \
        -DgRPC_INSTALL=ON \
        -DgRPC_BUILD_TESTS=OFF \
        -DBUILD_SHARED_LIBS=ON \
        -DCMAKE_SHARED_LINKER_FLAGS_INIT=-Wl,--enable-new-dtags,--build-id=sha1,--rpath,'$ORIGIN' \
        -DCMAKE_INSTALL_PREFIX="$GRPC_ROOT" \
        -DCMAKE_INSTALL_LIBDIR=lib \
        -DCMAKE_BUILD_TYPE=Release
    make -C build -j $(nproc)
    sudo make -C build install
    echo "$GRPC_ROOT" | sudo tee /etc/ld.so.conf.d/grpc.conf
    sudo ldconfig
    cd ..

πŸ”§ Building RDC

  1. Clone the RDC Repository:

    git clone https://github.com/ROCm/rdc
    cd rdc
  2. Configure the Build:

    cmake -B build -DGRPC_ROOT="$GRPC_ROOT"
    • Optional Features:
      • Enable ROCm Profiler:

        cmake -B build -DBUILD_PROFILER=ON
      • Enable RVS:

        cmake -B build -DBUILD_RVS=ON
      • Build RDC Library Only (without rdci and rdcd):

        cmake -B build -DBUILD_STANDALONE=OFF
      • Build RDC Library Without ROCm Run-time:

        cmake -B build -DBUILD_RUNTIME=OFF
  3. Build and Install:

    make -C build -j $(nproc)
    sudo make -C build install
  4. Update System Library Path:

    export RDC_LIB_DIR=/opt/rocm/lib/rdc
    export GRPC_LIB_DIR="/opt/grpc/lib"
    echo "${RDC_LIB_DIR}" | sudo tee /etc/ld.so.conf.d/x86_64-librdc_client.conf
    echo "${GRPC_LIB_DIR}" | sudo tee -a /etc/ld.so.conf.d/x86_64-librdc_client.conf
    sudo ldconfig

πŸ“Š Features Overview

πŸ” Discovery

Locate and display information about GPUs present in a compute node.

Example:

rdci discovery <host_name> -l

Output:

2 GPUs found

+-----------+----------------------------------------------+
| GPU Index | Device Information                           |
+-----------+----------------------------------------------+
| 0         | Name: AMD Radeon Instinct MI50 Accelerator   |
| 1         | Name: AMD Radeon Instinct MI50 Accelerator   |
+-----------+----------------------------------------------+

πŸ‘₯ Groups

πŸ–₯️ GPU Groups

Create, delete, and list logical groups of GPUs.

Create a Group:

rdci group -c GPU_GROUP

Add GPUs to Group:

rdci group -g 1 -a 0,1

List Groups:

rdci group -l

Delete a Group:

rdci group -d 1

πŸ—‚οΈ Field Groups

Manage field groups to monitor specific GPU metrics.

Create a Field Group:

rdci fieldgroup -c <fgroup> -f 150,155

List Field Groups:

rdci fieldgroup -l

Delete a Field Group:

rdci fieldgroup -d 1

Important

πŸ›‘ Monitor Errors

Define fields to monitor RAS ECC counters.

  • Correctable ECC Errors:

    312 RDC_FI_ECC_CORRECT_TOTAL
  • Uncorrectable ECC Errors:

    313 RDC_FI_ECC_UNCORRECT_TOTAL

πŸ“ˆ Device Monitoring

Monitor GPU fields such as temperature, power usage, and utilization.

Command:

rdci dmon -f <field_group> -g <gpu_group> -c 5 -d 1000

Sample Output:

1 group found

+-----------+-------------+---------------+
| GPU Index | TEMP (mΒ°C)  | POWER (Β΅W)    |
+-----------+-------------+---------------+
| 0         | 25000       | 520500        |
+-----------+-------------+---------------+

πŸ“Š Job Stats

Display GPU statistics for any given workload.

Start Recording Stats:

rdci stats -s 2 -g 1

Stop Recording Stats:

rdci stats -x 2

Display Job Stats:

rdci stats -j 2

Sample Output:

Summary:
Executive Status:

Start time: 1586795401
End time: 1586795445
Total execution time: 44

Energy Consumed (Joules): 21682
Power Usage (Watts): Max: 49 Min: 13 Avg: 34
GPU Clock (MHz): Max: 1000 Min: 300 Avg: 903
GPU Utilization (%): Max: 69 Min: 0 Avg: 2
Max GPU Memory Used (bytes): 524320768
Memory Utilization (%): Max: 12 Min: 11 Avg: 12

🩺 Diagnostic

Run diagnostics on a GPU group to ensure system health.

Command:

rdci diag -g <gpu_group>

Sample Output:

No compute process:  Pass
Node topology check:  Pass
GPU parameters check:  Pass
Compute Queue ready:  Pass
System memory check:  Pass
=============== Diagnostic Details ==================
No compute process:  No processes running on any devices.
Node topology check:  No link detected.
GPU parameters check:  GPU 0 Critical Edge temperature in range.
Compute Queue ready:  Run binary search task on GPU 0 Pass.
System memory check:  Max Single Allocation Memory Test for GPU 0 Pass. CPUAccessToGPUMemoryTest for GPU 0 Pass. GPUAccessToCPUMemoryTest for GPU 0 Pass.

πŸ”Œ Integration with Third-Party Tools

RDC integrates seamlessly with tools like Prometheus, Grafana, and Reliability, Availability, and Serviceability (RAS) to enhance monitoring and visualization.

🐍 Python Bindings

RDC provides a generic Python class RdcReader to simplify telemetry gathering.

Sample Program:

from RdcReader import RdcReader
from RdcUtil import RdcUtil
from rdc_bootstrap import *
import time

default_field_ids = [
    rdc_field_t.RDC_FI_POWER_USAGE,
    rdc_field_t.RDC_FI_GPU_UTIL
]

class SimpleRdcReader(RdcReader):
    def __init__(self):
        super().__init__(ip_port=None, field_ids=default_field_ids, update_freq=1000000)

    def handle_field(self, gpu_index, value):
        field_name = self.rdc_util.field_id_string(value.field_id).lower()
        print(f"{value.ts} {gpu_index}:{field_name} {value.value.l_int}")

if __name__ == '__main__':
    reader = SimpleRdcReader()
    while True:
        time.sleep(1)
        reader.process()

Running the Example:

# Ensure RDC shared libraries are in the library path and RdcReader.py is in PYTHONPATH
python SimpleReader.py

πŸ“ˆ Prometheus Plugin

The Prometheus plugin allows you to monitor events and send alerts.

Installation:

  1. Install Prometheus Client:

    pip install prometheus_client
  2. Run the Prometheus Plugin:

    python rdc_prometheus.py
  3. Verify Plugin:

    curl localhost:5000

Integration Steps:

  1. Download and Install Prometheus:

  2. Configure Prometheus Targets:

    • Modify prometheus_targets.json to point to your compute nodes.
    [
      {
        "targets": [
          "rdc_test1.amd.com:5000",
          "rdc_test2.amd.com:5000"
        ]
      }
    ]
  3. Start Prometheus with Configuration File:

    prometheus --config.file=/path/to/rdc_prometheus_example.yml
  4. Access Prometheus UI:

πŸ“Š Grafana Integration

Grafana provides advanced visualization capabilities for RDC metrics.

Installation:

  1. Download Grafana:

  2. Install Grafana:

  3. Start Grafana Server:

    sudo systemctl start grafana-server
    sudo systemctl status grafana-server
  4. Access Grafana:

Configuration Steps:

  1. Add Prometheus Data Source:

    • Navigate to Configuration β†’ Data Sources β†’ Add data source β†’ Prometheus.
    • Set the URL to http://localhost:9090 and save.
  2. Import RDC Dashboard:

    • Click the + icon and select Import.
    • Upload rdc_grafana_dashboard_example.json from the python_binding folder.
    • Select the desired compute node for visualization.

πŸ›‘οΈ Reliability, Availability, and Serviceability (RAS) Plugin

The RAS plugin enables monitoring and counting of ECC (Error-Correcting Code) errors.

Installation:

  1. Ensure GPU Supports RAS:

    • The GPU must support RAS features.
  2. RDC Installation Includes RAS Library:

    • librdc_ras.so is located in /opt/rocm-4.2.0/rdc/lib.

Usage:

  • Monitor ECC Errors:

    rdci dmon -i 0 -e 600,601

    Sample Output:

    GPU     ECC_CORRECT         ECC_UNCORRECT
    0       0                   0
    

Important

🐞 Troubleshooting

Known Issues

πŸ›‘ dmon Fields Return N/A

  1. Missing Libraries:

    • Verify /opt/rocm/lib/rdc/librdc_*.so exists.
    • Ensure all related libraries (rocprofiler, rocruntime, etc.) are present.
  2. Unsupported GPU:

    • Most metrics work on MI300 and newer.
    • Limited metrics on MI200.
    • Consumer GPUs (e.g., RX6800) have fewer supported metrics.

🐍 dmon RocProfiler Fields Return Zeros

Solution:

Set the HSA_TOOLS_LIB environment variable before running a compute job.

export HSA_TOOLS_LIB=/opt/rocm/lib/librocprofiler64.so.1

Example:

# Terminal 1
rdcd -u

# Terminal 2
export HSA_TOOLS_LIB=/opt/rocm/lib/librocprofiler64.so.1
gpu-burn

# Terminal 3
rdci dmon -u -e 800,801 -i 0 -c 1

# Output:
GPU   OCCUPANCY_PERCENT   ACTIVE_WAVES
0     001.000             32640.000

⚠️ HSA_STATUS_ERROR_OUT_OF_RESOURCES

Error Message:

terminate called after throwing an instance of 'std::runtime_error'
 what():  hsa error code: 4104 HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
Aborted (core dumped)

Solution:

  1. Missing Groups:

    • Ensure video and render groups exist.
    sudo usermod -aG video,render $USER
    • Log out and log back in to apply group changes.

πŸ› Troubleshooting RDCD

  • View RDCD Logs:

    sudo journalctl -u rdc
  • Run RDCD with Debug Logs:

    RDC_LOG=DEBUG /opt/rocm/bin/rdcd
    • Logging Levels Supported: ERROR, INFO, DEBUG
  • Enable Additional Logging Messages:

    export RSMI_LOGGING=3

πŸ“„ License

RDC is open-source and available under the MIT License.


πŸ“§ Support

For support and further inquiries, please refer to the ROCm Documentation or contact the maintainers through the repository's issue tracker.