Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute noise_variance_ in PCA implementation #6234

Open
wants to merge 2 commits into
base: branch-25.02
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 17 additions & 2 deletions cpp/src/pca/pca.cuh
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2018-2024, NVIDIA CORPORATION.
* Copyright (c) 2018-2025, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -41,6 +41,7 @@ void truncCompExpVars(const raft::handle_t& handle,
math_t* components,
math_t* explained_var,
math_t* explained_var_ratio,
math_t* noise_vars,
const paramsTSVDTemplate<enum_solver>& prms,
cudaStream_t stream)
{
Expand All @@ -67,6 +68,20 @@ void truncCompExpVars(const raft::handle_t& handle,
prms.n_components,
std::size_t(1),
stream);

// Compute the scalar noise_vars defined as (pseudocode)
// (n_components < min(n_cols, n_rows)) ? explained_var_all[n_components:].mean() : 0
if (prms.n_components < prms.n_cols && prms.n_components < prms.n_rows) {
raft::stats::mean(noise_vars,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use the newer mdspan-API of this mean function as well?

Copy link
Member Author

@jcrist jcrist Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's my understanding that since the PCA code here doesn't create streams from the handle (#2470), and is mostly using the legacy APIs here that take streams instead of handles, that switching to the mdspan APIs instead would require a substantial refactor. I'm happy to take that on in a follow-up, but would prefer to port all the code to the newer APIs at the same time, rather than try to force it in as part of this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's a valid point, let's keep that port for an other dedicated PR.

explained_var_all.data() + prms.n_components,
std::size_t{1},
prms.n_cols - prms.n_components,
false,
true,
stream);
} else {
raft::matrix::setValue(noise_vars, noise_vars, math_t{0}, 1, stream);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noise_vars is a len-1 device array that I want to set to 0. This seems to work (and is used at least one other place in cuml already), but I'm honestly not sure if it's the best spelling of that. In particular, I don't understand why a method for filling an output array with a single value also takes in an input array (which is the same as the output array in all calling locations I can find).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to use raft::matrix::fill from raft/include/matrix/init.cuh Line 68. It is the more up-to-date function. You would need to create a mdspan to use it as such:

auto noise_vars_span = raft::make_device_vector_view(noise_vars, 1);

}
}

/**
Expand Down Expand Up @@ -116,7 +131,7 @@ void pcaFit(const raft::handle_t& handle,
raft::stats::cov(
handle, cov.data(), input, mu, prms.n_cols, prms.n_rows, true, false, true, stream);
truncCompExpVars(
handle, cov.data(), components, explained_var, explained_var_ratio, prms, stream);
handle, cov.data(), components, explained_var, explained_var_ratio, noise_vars, prms, stream);

math_t scalar = (prms.n_rows - 1);
raft::matrix::seqRoot(explained_var, singular_vals, scalar, n_components, stream, true);
Expand Down
21 changes: 16 additions & 5 deletions cpp/src/pca/pca_mg.cu
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2019-2024, NVIDIA CORPORATION.
* Copyright (c) 2019-2025, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -69,7 +69,7 @@ void fit_impl(raft::handle_t& handle,
Stats::opg::cov(handle, cov, input_data, input_desc, mu_data, true, streams, n_streams);

ML::truncCompExpVars<T, mg_solver>(
handle, cov.ptr, components, explained_var, explained_var_ratio, prms, streams[0]);
handle, cov.ptr, components, explained_var, explained_var_ratio, noise_vars, prms, streams[0]);

T scalar = (prms.n_rows - 1);
raft::matrix::seqRoot(explained_var, singular_vals, scalar, prms.n_components, streams[0], true);
Expand Down Expand Up @@ -128,9 +128,6 @@ void fit_impl(raft::handle_t& handle,
streams,
n_streams,
verbose);
for (std::uint32_t i = 0; i < n_streams; i++) {
handle.sync_stream(streams[i]);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sync point was unneeded, it was already handled after exiting this branch below.

} else if (prms.algorithm == mg_solver::QR) {
const raft::handle_t& h = handle;
cudaStream_t stream = h.get_stream();
Expand Down Expand Up @@ -194,6 +191,20 @@ void fit_impl(raft::handle_t& handle,
std::size_t(1),
stream);

// Compute the scalar noise_vars defined as (pseudocode)
// (n_components < min(n_cols, n_rows)) ? explained_var_all[n_components:].mean() : 0
if (prms.n_components < prms.n_cols && prms.n_components < prms.n_rows) {
raft::stats::mean(noise_vars,
explained_var_all.data() + prms.n_components,
std::size_t{1},
prms.n_cols - prms.n_components,
false,
true,
stream);
} else {
raft::matrix::setValue(noise_vars, noise_vars, T{0}, 1, stream);
}

raft::linalg::transpose(vMatrix.data(), prms.n_cols, stream);
raft::matrix::truncZeroOrigin(
vMatrix.data(), prms.n_cols, components, prms.n_components, prms.n_cols, stream);
Expand Down
6 changes: 3 additions & 3 deletions python/cuml/cuml/decomposition/pca.pyx
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#
# Copyright (c) 2019-2024, NVIDIA CORPORATION.
# Copyright (c) 2019-2025, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -171,8 +171,8 @@ class PCA(UniversalBase,
1 2.333...
2 2.333...
dtype: float32
>>> print(f'noise variance: {pca_float.noise_variance_}')
noise variance: 0 0.0
>>> print(f'noise variance: {pca_float.noise_variance_}') # doctest: +SKIP
noise variance: 0 -7.377287e-08
dtype: float32
>>> trans_gdf_float = pca_float.transform(gdf_float)
>>> print(f'Inverse: {trans_gdf_float}') # doctest: +SKIP
Expand Down
20 changes: 18 additions & 2 deletions python/cuml/cuml/tests/test_pca.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2019-2023, NVIDIA CORPORATION.
# Copyright (c) 2019-2025, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -72,13 +72,13 @@ def test_pca_fit(datatype, input_type, name, use_handle):
"components_",
"explained_variance_",
"explained_variance_ratio_",
"noise_variance_",
]:
with_sign = False if attr in ["components_"] else True
print(attr)
print(getattr(cupca, attr))
print(getattr(skpca, attr))
cuml_res = getattr(cupca, attr)

skl_res = getattr(skpca, attr)
assert array_equal(cuml_res, skl_res, 1e-3, with_sign=with_sign)

Expand Down Expand Up @@ -304,6 +304,22 @@ def test_sparse_pca_inputs(nrows, ncols, whiten, return_sparse, cupy_input):
assert array_equal(i_sparse, X.todense(), 1e-1, with_sign=True)


@pytest.mark.parametrize(
"n_samples, n_features",
[
pytest.param(9, 20, id="n_samples <= n_components"),
pytest.param(20, 10, id="n_features <= n_components"),
],
)
def test_noise_variance_zero(n_samples, n_features):
X, _ = make_blobs(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a test other than zero be added? To check the correctness of the computation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is checking the case where the noise variance is defined as zero since there weren't enough samples or features. Other cases where it's non zero are already tested thoroughly above on line 75.

n_samples=n_samples, n_features=n_features, random_state=0
)
cupca = cuPCA(n_components=10)
cupca.fit(X)
assert cupca.noise_variance_.item() == 0


def test_exceptions():
with pytest.raises(NotFittedError):
X = cp.random.random((10, 10))
Expand Down
Loading