Platform specific largest floating point type #214
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hello 👋 This library helped me with my uni coursework and I wanted to fix two issues I ran into.
Summary
Certain metrics have the
torch.float64
type hardcoded but not all backends support it, MPS in particular. The device the metric is meant to reside on is now used to determine the largest available floating-point type: eithertorch.float64
ortorch.float32
.Certain metrics use the deprecated
Tensor.scatter_(dim, index, src, reduce="add")
which I've replaced with the equivalent non-deprecatedTensor.scatter_add_(dim, index, src)
.I've made small fixes along the way in the form of correcting docstring typos, removing duplicated code, and fixing a minor bug in the generated documentation.
Test plan
Environment:
PYTORCH_ENABLE_MPS_FALLBACK=1
, gotten as a recommendation after initially running withoutThe main change is platform-specific and I don't have all the supported backends at my disposal, but I've tested
largest_float
withtorch.device("cpu")
andtorch.device("mps")
on my machine, andtorch.device("xla")
andtorch.device("cuda")
in Colab.I've otherwise run the entire test suite on my machine and passed all but nine tests.
6 failures from
tests/metrics/test_synclib.py
SynclibTest::test_sync_dtype_and_shape
fails because MPS doesn't supporttorch.float64
. The other 5 failing tests are:test_tensor_sync_states
test_tensor_list_sync_states
test_tensor_dict_sync_states
test_complex_mixed_state_sync
test_empty_tensor_list_sync_state
which all yield
torch.distributed.elastic.multiprocessing.errors.ChildFailedError
errors that trace back to the following errorindicating that CPU and MPS tensors are being compared at some point in the testing process. I'm guessing the tests weren't written with the CPU fallback in mind but haven't investigated any further.
2 failures from
tests/metrics/test_toolkit.py
MetricToolkitTest::test_metric_sync
MetricCollectionToolkitTest::test_metric_collection_sync
both yielding the same error message as the 5 tests from
test_synclib.py
.1 failure from
tests/metrics/image/test_fid.py
TestFrechetInceptionDistance::test_fid_random_data_default_model
fails on anAssertionError
:I haven't made any changes to the
FrechetInceptionDistance
metric and have theimage-requirements.txt
dependencies installed, so was this test case always failing or have I encountered a platform-specific bug?Questions
torchaudio
requirementI added the new
audio-requirements.txt
file in line with theimage-requirements.txt
naming convention for thetorchaudio
dependency for using the audio metrics. In general, is there a reliance on these requirements being specified in distinct text files that are preventing them from being specified in thepyproject.toml
?skimage
requirementscikit-image is now installed with
pip install scikit-image
instead ofpip install skimage
, so installing the image requirements withpip install -r image-requirements.txt
raises an error: can this dependency be renamed without other conflicts?Union type
When rebasing my fork I reverted changes to 3 files that moved from
typing.Union
to the newer|
union type to annotate function parameters. I noticed a large number of files use the|
style, but the project README requires "Python >= 3.8" while the|
type was introduced in 3.10, so maybe bump up the required Python version?Failed to recreate usage example
When fixing docstring typos I tried to recreate the second usage example for the
WindowedBinaryAUROC
metric.Expected:
Actual:
I'm not entirely sure why the metric inputs look the way they do anyway so I'm struggling to follow along with the logic: is this a bug?
MPS-specific bug
My system version of Python is high and so not ideal for working on most projects, so I used
uv
to create a virtual environment with a lower version of Python. I made mistakes setting it up properly and was inadvertently working withtorch==2.2.0
(which meets the project requirement of "PyTorch >= 1.11").When testing the different metrics with the device set to the CPU and then the MPS backend, I encountered a bug with the BLEU score metric. On CPU the example highlighted in the docstring works as expected:
However the same example when run on the MPS backend yields
I tracked the issue down to lines 104-109 in
torcheval/metrics/functional/text/bleu.py
:The MPS implementation of in-place addition was buggy and the tensors
matches_by_order
andpossible_matches_by_order
would only be modified at their zero-th index and nowhere else. Replacing the above with:fixed the issue
I didn't think this fix was worth the non-Pythonic code, so I left the in-place addition untouched and upgraded my version of PyTorch. Although I couldn't track down the exact commit that fixed it, the bug appears to have been resolved in
torch==2.4.0
at the earliest. Is this worth adding a platform-specific PyTorch version dependency in case macOS users want to use the MPS backend for the BLEU score metric?