-
Notifications
You must be signed in to change notification settings - Fork 546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop using boston housing dataset #6180
base: branch-25.02
Are you sure you want to change the base?
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Instead we use the california housing dataset.
3e090c2
to
0112051
Compare
Looks like there are some failures in
https://github.com/rapidsai/cuml/actions/runs/12297823749/job/34322294951?pr=6180#step:8:8331 |
@betatim can you open an issue with the log of the error? I think we should use this PR to debug that failure. Specifically, this fails with this specific set of relevant parameters:
This combination of parameters should give us a clue when debugging. Important that it is with On a very first quick thought, this seems to suggest to me a bug either in the cuml/cpp/src/hdbscan/detail/select.cuh Line 269 in 7211507
|
Something additional that always frustrates me about assert in pytests, I'd love to know beforehand the results of the following assertions (calculating
Might be useful to addopt that approach in general, will open an issue to capture that as well. |
An interesting thing that I didn't spot initially is that the length of the arrays is different. So isn't just the number of samples assigned to each cluster that differs, but also the number of clusters in total (I think). Poking around a bit to understand what is what |
|
Here's a full sequence of
where the NOTE: I can also verify with an eye-test that almost all data points in the mutual reachability graph seem to have similar edge weight patterns. |
The take away here is that we can't make the current test pass and that is a feature (not a bug)? Because cuml uses a parallel MST structure we can't promise to break ties in the same way as a sequential implementation. If yes, I see two(+1) options:
WDYT? |
@betatim I would want to try a dataset with fewer repetitive patterns. California housing dataset has a bunch of repetitive values in the first 3 columns. |
I don't see the repetitiveness?
A lot of 52 year old houses, but otherwise this doesn't look too bad? Do you have an alternative dataset in mind? And for my education, why would having repetitive values be a problem? Real world data might have this as well no? |
Instead we use the california housing dataset.
closes #5158