Hybrid execution of sterf #865

tfalders · 2024-12-04T18:43:03Z

This is a revamp of #462 using the new hybrid infrastructure that was introduced for GESVD. Here's a brief summary of changes:

Added CPU execution path for STERF. Tests for STERF_HYBRID and SYEV_HYBRID have been added to verify correctness.
Rearranged functions in lib_device_helpers, so that device functions are no longer present in the kernels section.
is_device_pointer has been moved from a lambda function to a reusable function in lib_host_helpers.
Created a new class, rocsolver_hybrid_array, to assist with memory allocation and data transfers to and from the device. I have revised rocsolver_bdsqr_host_batch_template to use this new functionality.

amd-jnovotny

changelog ok

EdDAzevedo · 2024-12-04T21:13:02Z

library/src/include/lib_device_helpers.hpp

+    If uplo = rocblas_fill_upper, only the upper triangular part is copied
+    If uplo = rocblas_fill_lower, only the lower triangular part is copied **/
+template <typename T, typename U, typename Mask = no_mask>
+ROCSOLVER_KERNEL void copy_mat(copymat_direction direction,


Minor suggestion: if these is intended to be a frequently reused library routine, perhaps we might consider a more general version from (A,shiftA,lda,strideA) to (B,shiftB,ldb,strideB). The special case of a linear "buffer" may be shiftB = 0, strideB = (m*n), ldb = m; By using a more general interface, perhaps we don't need the extra enum for "direction". We just need to reverse the order of the "A" and "B" arguments.

Perhaps consider a more suggestive name for Mask. It is not immediately obvious to me whether we perform the copy if the mask is non-zero, or skip the copy if the mask is non-zero.

Is there a convention to place action arguments before the data? Say in TRSM, we place the uplo, side, trans, diag, arguments before the data, instead of placing the diag or uplo at the end of subroutine argument list.

IIRC, this function has multiple overloads where B has fewer associated parameters than A. In these cases, specifying a direction is necessary since we can't easily swap A and B. It's not necessary for this particular overload, but I think we kept it for consistency.

I don't entirely remember how the mask works either, so we should definitely improve documentation at a minimum.

Conventionally, yes, those arguments are placed at the start, but when they have default values they can't be put before any arguments without default values. I seem to remember they were tacked on after the fact, and assigned default values to avoid breaking existing calls.

EdDAzevedo · 2024-12-04T21:22:39Z

library/src/include/lib_device_helpers.hpp

+            T* Ap = load_ptr_batch<T>(A, b, shiftA, strideA);
+            T* Bp = &buffer[b * strideB];
+
+            if(direction == copymat_to_buffer)


I wonder whether this library routine can be further generalized to add a "trans" option to perform conjugate_transpose or transpose or none. Magma blas has a routine to perform efficient matrix transpose. Just a suggestion.

We have a copy_trans_mat function that does have this functionality.

EdDAzevedo · 2024-12-04T21:49:55Z

library/src/include/rocsolver_hybrid_array.hpp

+ROCSOLVER_BEGIN_NAMESPACE
+
+template <typename T, typename I, typename U>
+struct rocsolver_hybrid_array


It would be nice if there is more documentation or description not of the internal implementation details but on the interface what it is intended to do and what functions or method are available. Is it keeping a data structure (strided array) on CPU and mirror it on GPU device, or vice versa? If so, perhaps the routines are trying to "sync"? Should there be "async" and "noasync" versions since the name of the routine "_async" is suggestive there is a noasync version. Are there calls to synchronize stream in the "_async" routines? If these are not truyly "async" routines, perhaps just leave out "_async"? I wonder whether the std::vector can be leveraged so there is less worry about memory leaks.

Yeah, the documentation is rather sparse at the moment. I can take another look and try to improve it.

I added the _async suffix mostly to indicate to anyone using the function that a hipStreamSynchronize needs to be done before using the results of the function. I could have synchronized within the functions themselves, but I wanted to give us the option of queuing a number of hipMemcpys that can be sync'ed all at once for better performance.

library/src/include/rocsolver_hybrid_array.hpp

EdDAzevedo · 2024-12-05T00:41:28Z

clients/gtest/lapack/syev_heev_gtest.cpp

@@ -103,11 +105,19 @@ class SYEV_HEEV : public ::TestWithParam<syev_heev_tuple>
    }
 };

-class SYEV : public SYEV_HEEV
+class SYEV : public SYEV_HEEV<0>


Minor: is there a convention to use all upper case for compile time constant or #define macros or constants? If so, perhaps consider using a mixed case? Just a thought.

I believe that changing the capitalization of the test class will also change the capitalization of the test suite output. I like the all caps text in the test output as it makes it very easy to pick out the function name.

qjojo · 2025-01-08T18:35:00Z

library/src/include/rocsolver_hybrid_array.hpp

+        if(batch_array && (val_array || this->dim < 0))
+            free(batch_array);
+    }
+


Since this class owns malloc-ed memory it might be a good idea to delete its implicit copy constructor and assignment operator to guard against a double free.

It would be nicer to follow the rule of 5 also explicitly default or delete the move constructor and the move assignment operator (those are implicitly deleted here, but you could also default them without any issues if this bodes well with your idea of how the struct is meant to be used).

I wonder whether using thrust::host_vector and thrust::device_vector would potentially simplify the code. This is in the spirit of "eating our own dog food" or "drinking our own champagne" in re-using AMD rocThrust software. Just a thought.

jmachado-amd

It looks good, thanks @tfalders!

The rocsolver_hybrid_array struct complements the implementation of the current hybrid methods well. I've just included a few comments there, let me know if you want me to clarify anything.

library/src/include/rocsolver_hybrid_array.hpp

jmachado-amd · 2025-01-29T22:27:49Z

library/src/include/rocsolver_hybrid_array.hpp

+
+    /* Used to read device pointers from a batched array for use on the host; no other data is read from the
+       device. */
+    rocblas_status init_pointers_only(U array, rocblas_stride stride, I batch_count, hipStream_t stream)


Minor comment (code design): the behaviour of this struct when initialized with init_pointers_only is so different from the bahaviour yielded by init_async that it makes sense to break it apart into two different structs. Maybe something to be pondered for the future?

library/src/include/rocsolver_hybrid_array.hpp

jmachado-amd · 2025-01-29T22:42:12Z

library/src/include/rocsolver_hybrid_array.hpp

+    I dim, batch_count;
+    rocblas_stride stride;
+
+    U src_array;
+    T** batch_array;
+    T* val_array;


Minor question (code design): what would be the use case in which we would want to access those members directly?

I don't believe there is a use case. Would you like me to mark them as private?

I pointed it out to start the conversation, I think that we will know better than to try and change those by hand; but it is always safer to constrain degrees of freedom that are not meant to be used.

Whether to keep or change the access type of those is a decision that I defer to you, I trust your judgement either way.

jmachado-amd · 2025-02-07T21:29:05Z

library/src/include/rocsolver_hybrid_storage.hpp

+            if(!val_array)
+                return rocblas_status_memory_error;
+#else
+            if(posix_memalign((void**)&val_array, sizeof(void*), val_bytes) != 0)


Small comment here: sizeof(void*) is just uintptr_t, so this is not really different from a malloc, more typical alignments would be 32 or 64. But I see the point in leaving this as is and deciding the final value after benchmarking.

jmachado-amd

It looks good to me, thanks again @tfalders! I left a couple of comments but nothing that requires any update at this point in time -- unless you want to update anything, of course.

jmachado-amd · 2025-02-07T22:45:02Z

library/src/include/lib_host_helpers.hpp

+
+    auto istat = hipPointerGetAttributes(&dev_attributes, ptr);
+    if(istat != hipSuccess)
+        fmt::print(stderr, "is_device_pointer: istat = {} {}\n", istat, hipGetErrorName(istat));


I've just noticed that this line is not compiling in Windows, likely because CI is using a version of libfmt that is newer than 10.0.0. Among many other things, that version deprecated the implicit conversion of enums, thus the simplest fix would be to update this line into something like:

fmt::print(stderr, "is_device_pointer: istat = {} {}\n", static_cast<std::int32_t>(istat), hipGetErrorName(istat));

Their documentation provides other options, but we should consider updating rocSOLVER to make those errors easier to catch going forward (of course, not in this PR). For the time being, please make sure to cast all inputs of fmt::print into basic types, or types defined in the std namespace.

tfalders added 8 commits December 2, 2024 15:43

Empty hybrid algorithm for sterf

dc47f3a

Implemented hybrid algorithm for sterf

8ac0874

Improve array management

1b280cc

Add test to syev/heev for hybrid sterf

a0b07cc

Use rocsolver_hybrid_array in hybrid bdsqr

eba2a3a

64-bit support for rocsolver_hybrid_array

029363d

Documentation for rocsolver_hybrid_array

1b5087d

Updated changelog

9341379

tfalders added the noOptimizations Disable optimized kernels for small sizes for some routines label Dec 4, 2024

tfalders requested review from jzuniga-amd, cgmb, qjojo, EdDAzevedo, jmachado-amd, AGonzales-amd and a team as code owners December 4, 2024 18:43

amd-jnovotny approved these changes Dec 4, 2024

View reviewed changes

EdDAzevedo reviewed Dec 4, 2024

View reviewed changes

library/src/include/rocsolver_hybrid_array.hpp Outdated Show resolved Hide resolved

EdDAzevedo reviewed Dec 5, 2024

View reviewed changes

EdDAzevedo approved these changes Dec 5, 2024

View reviewed changes

Addressed review comments

a73ae86

qjojo reviewed Jan 8, 2025

View reviewed changes

qjojo approved these changes Jan 8, 2025

View reviewed changes

Addressed review comment

9a8acac

jmachado-amd reviewed Jan 29, 2025

View reviewed changes

tfalders added 2 commits February 6, 2025 15:38

Addressed review comments

5617656

Merge branch 'develop' into sterf_hybrid

471b6ed

Aligned memory allocations

d990cc9

tfalders force-pushed the sterf_hybrid branch from cfc9769 to d990cc9 Compare February 6, 2025 23:16

jmachado-amd reviewed Feb 7, 2025

View reviewed changes

jmachado-amd approved these changes Feb 7, 2025

View reviewed changes

jmachado-amd reviewed Feb 7, 2025

View reviewed changes

Fix failing Windows build

1b740c4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hybrid execution of sterf #865

Hybrid execution of sterf #865

tfalders commented Dec 4, 2024

amd-jnovotny left a comment

EdDAzevedo Dec 4, 2024

tfalders Dec 4, 2024

EdDAzevedo Dec 4, 2024

tfalders Dec 4, 2024

EdDAzevedo Dec 4, 2024

tfalders Dec 4, 2024

EdDAzevedo Dec 5, 2024

tfalders Dec 5, 2024

qjojo Jan 8, 2025

jmachado-amd Jan 29, 2025

EdDAzevedo Jan 30, 2025

jmachado-amd left a comment

jmachado-amd Jan 29, 2025

jmachado-amd Jan 29, 2025

tfalders Feb 6, 2025

jmachado-amd Feb 7, 2025

jmachado-amd Feb 7, 2025

jmachado-amd left a comment

jmachado-amd Feb 7, 2025

Hybrid execution of sterf #865

Are you sure you want to change the base?

Hybrid execution of sterf #865

Conversation

tfalders commented Dec 4, 2024

amd-jnovotny left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmachado-amd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmachado-amd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment