Platform specific largest floating point type #214

iyassou · 2025-01-20T02:21:14Z

Hello 👋 This library helped me with my uni coursework and I wanted to fix two issues I ran into.

Summary

Certain metrics have the torch.float64 type hardcoded but not all backends support it, MPS in particular. The device the metric is meant to reside on is now used to determine the largest available floating-point type: either torch.float64 or torch.float32.

Certain metrics use the deprecated Tensor.scatter_(dim, index, src, reduce="add") which I've replaced with the equivalent non-deprecated Tensor.scatter_add_(dim, index, src).

I've made small fixes along the way in the form of correcting docstring typos, removing duplicated code, and fixing a minor bug in the generated documentation.

Test plan

Environment:

OS: macOS Sonoma 14.6.1
Machine: M3 MacBook Pro
RAM: 48GB
Python: 3.10.16 (within a virtual environment)
Environment Variables: PYTORCH_ENABLE_MPS_FALLBACK=1, gotten as a recommendation after initially running without

The main change is platform-specific and I don't have all the supported backends at my disposal, but I've tested largest_float with torch.device("cpu") and torch.device("mps") on my machine, and torch.device("xla") and torch.device("cuda") in Colab.

I've otherwise run the entire test suite on my machine and passed all but nine tests.

6 failures from `tests/metrics/test_synclib.py`

SynclibTest::test_sync_dtype_and_shape fails because MPS doesn't support torch.float64. The other 5 failing tests are:

test_tensor_sync_states
test_tensor_list_sync_states
test_tensor_dict_sync_states
test_complex_mixed_state_sync
test_empty_tensor_list_sync_state

which all yield torch.distributed.elastic.multiprocessing.errors.ChildFailedError errors that trace back to the following error

RuntimeError: ProcessGroupGloo::allgather: invalid tensor type at index 0 (expected TensorOptions(dtype=float, device=cpu, layout=Strided, requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)), got TensorOptions(dtype=float, device=mps:0, layout=Strided, requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))

indicating that CPU and MPS tensors are being compared at some point in the testing process. I'm guessing the tests weren't written with the CPU fallback in mind but haven't investigated any further.

2 failures from `tests/metrics/test_toolkit.py`

MetricToolkitTest::test_metric_sync
MetricCollectionToolkitTest::test_metric_collection_sync

both yielding the same error message as the 5 tests from test_synclib.py.

1 failure from `tests/metrics/image/test_fid.py`

TestFrechetInceptionDistance::test_fid_random_data_default_model fails on an AssertionError:

AssertionError: Scalars are not close!

Expected 4.483039855957031 but got 4.410880088806152.
Absolute difference: 0.0721597671508789 (up to 0.01 allowed)
Relative difference: 0.016096168999031657 (up to 0.01 allowed)

I haven't made any changes to the FrechetInceptionDistance metric and have the image-requirements.txt dependencies installed, so was this test case always failing or have I encountered a platform-specific bug?

Questions

`torchaudio` requirement

I added the new audio-requirements.txt file in line with the image-requirements.txt naming convention for the torchaudio dependency for using the audio metrics. In general, is there a reliance on these requirements being specified in distinct text files that are preventing them from being specified in the pyproject.toml?

`skimage` requirement

scikit-image is now installed with pip install scikit-image instead of pip install skimage, so installing the image requirements with pip install -r image-requirements.txt raises an error: can this dependency be renamed without other conflicts?

Union type

When rebasing my fork I reverted changes to 3 files that moved from typing.Union to the newer | union type to annotate function parameters. I noticed a large number of files use the | style, but the project README requires "Python >= 3.8" while the | type was introduced in 3.10, so maybe bump up the required Python version?

Failed to recreate usage example

When fixing docstring typos I tried to recreate the second usage example for the WindowedBinaryAUROC metric.

Expected:

>>> metric = WindowedBinaryAUROC(max_num_samples=5, num_tasks=2)
>>> metric.update(torch.tensor([[0.2, 0.3], [0.5, 0.1]]), torch.tensor([[1.0, 0.0], [0.0, 1.0]]))
>>> metric.update(torch.tensor([[0.8, 0.3], [0.6, 0.1]]), torch.tensor([[1.0, 1.0], [1.0, 0.0]]))
>>> metric.update(torch.tensor([[0.5, 0.1], [0.3, 0.9]]), torch.tensor([[0.0, 1.0], [0.0, 0.0]]))
>>> metric.inputs
tensor([[0.1000, 0.3000, 0.8000, 0.3000, 0.5000],
        [0.9000, 0.1000, 0.6000, 0.1000, 0.3000]])
>>> metric.targets
tensor([[1., 0., 1., 1., 0.],
        [0., 1., 1., 0., 0.]])
>>> metric.compute()
tensor([0.4167, 0.5000])

Actual:

>>> metric = WindowedBinaryAUROC(max_num_samples=5, num_tasks=2)
>>> metric.update(torch.tensor([[0.2, 0.3], [0.5, 0.1]]), torch.tensor([[1.0, 0.0], [0.0, 1.0]]))
>>> metric.update(torch.tensor([[0.8, 0.3], [0.6, 0.1]]), torch.tensor([[1.0, 1.0], [1.0, 0.0]]))
>>> metric.update(torch.tensor([[0.5, 0.1], [0.3, 0.9]]), torch.tensor([[0.0, 1.0], [0.0, 0.0]]))
>>> metric.inputs
tensor([[0.1000, 0.3000, 0.8000, 0.3000, 0.5000],
        [0.9000, 0.1000, 0.6000, 0.1000, 0.3000]])
>>> metric.targets
tensor([[1., 0., 1., 1., 0.],
        [0., 1., 1., 0., 0.]])
>>> metric.compute()
tensor([0.4167, 0.4167], dtype=torch.float64)

I'm not entirely sure why the metric inputs look the way they do anyway so I'm struggling to follow along with the logic: is this a bug?

MPS-specific bug

My system version of Python is high and so not ideal for working on most projects, so I used uv to create a virtual environment with a lower version of Python. I made mistakes setting it up properly and was inadvertently working with torch==2.2.0 (which meets the project requirement of "PyTorch >= 1.11").

When testing the different metrics with the device set to the CPU and then the MPS backend, I encountered a bug with the BLEU score metric. On CPU the example highlighted in the docstring works as expected:

>>> import torch
>>> from torcheval.metrics import BLEUScore
>>> metric = BLEUScore(n_gram=4)
>>> candidates = ["the squirrel is eating the nut", "the cat is on the mat"]
>>> references = [["a squirrel is eating a nut", "the squirrel is eating a tasty nut"], ["there is a cat on the mat", "a cat is on the mat"]]
>>> metric.update(candidates, references)
>>> metric.compute()
tensor(0.65341892)

However the same example when run on the MPS backend yields

>>> metric = BLEUScore(n_gram=4, device=torch.device("mps"))
>>> metric.update(candidates, references)                                                                                        
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/iyassou/Desktop/torcheval/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Users/iyassou/Desktop/torcheval/torcheval/metrics/text/bleu.py", line 100, in update
    ) = _bleu_score_update(input, target, self.n_gram, self.device)
  File "/Users/iyassou/Desktop/torcheval/torcheval/metrics/functional/text/bleu.py", line 112, in _bleu_score_update
    raise ValueError(
ValueError: the input is too short to find all n-gram matches with n_gram=4

I tracked the issue down to lines 104-109 in torcheval/metrics/functional/text/bleu.py:

        for ngram in overlap:
            matches_by_order[len(ngram) - 1] += overlap[ngram]

        for i in range(n_gram):
            if len_candidate - i > 0:
                possible_matches_by_order[i] += len_candidate - i

The MPS implementation of in-place addition was buggy and the tensors matches_by_order and possible_matches_by_order would only be modified at their zero-th index and nowhere else. Replacing the above with:

        for ngram in overlap:
            matches_by_order[len(ngram) - 1] = matches_by_order[len(ngram) - 1] + overlap[ngram]

        for i in range(n_gram):
            if len_candidate - i > 0:
                possible_matches_by_order[i] = possible_matches_by_order[i] + len_candidate - i

fixed the issue

>>> metric = BLEUScore(n_gram=4, device=torch.device("mps"))
>>> metric.update(candidates, references)
<torcheval.metrics.text.bleu.BLEUScore object at 0x1031fcfa0>
>>> metric.compute()
tensor(0.6534, device='mps:0')

I didn't think this fix was worth the non-Pythonic code, so I left the in-place addition untouched and upgraded my version of PyTorch. Although I couldn't track down the exact commit that fixed it, the bug appears to have been resolved in torch==2.4.0 at the earliest. Is this worth adding a platform-specific PyTorch version dependency in case macOS users want to use the MPS backend for the BLEU score metric?

PyTorch raises a `DeprecationWarning` when you call `Tensor.scatter_(dim, index, src, reduce="add")`. It's been replaced by the equivalent and non-deprecated call to `Tensor.scatter_add_(dim, index, src)`.

Added docstring links to objects, separate requirements file for developing audio metrics, updated documentation reST files, and fixed some typos.

The empty case now returns a tensor that's on the same device as the metric.

The function to do so was defined but never called.

Staying consistent with the return types expected by the text metrics suite, as seen in `test_word_information_lost.py` and `test_word_information_preserved.py`.

I accepted the wrong changes when rebasing my fork. I've reinstated the old style of typing in the modified functions because the project README requires "Python >= 3.8" and the `|` Union type was introduced in 3.10.

iyassou added 9 commits January 19, 2025 23:04

Platform-specific largest float

5095375

Straightforward uses of largest_float

14b7498

Heed deprecation warning

2eb3047

PyTorch raises a `DeprecationWarning` when you call `Tensor.scatter_(dim, index, src, reduce="add")`. It's been replaced by the equivalent and non-deprecated call to `Tensor.scatter_add_(dim, index, src)`.

Amend documentation and requirements

fef3caa

Added docstring links to objects, separate requirements file for developing audio metrics, updated documentation reST files, and fixed some typos.

Set empty tensor device

84f0d42

The empty case now returns a tensor that's on the same device as the metric.

Validate torchvision availability

043be16

The function to do so was defined but never called.

Change expected return type

dad41ff

Staying consistent with the return types expected by the text metrics suite, as seen in `test_word_information_lost.py` and `test_word_information_preserved.py`.

Fix EOF

80006c8

Erroneous merge resolution

f4810c9

I accepted the wrong changes when rebasing my fork. I've reinstated the old style of typing in the modified functions because the project README requires "Python >= 3.8" and the `|` Union type was introduced in 3.10.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Platform specific largest floating point type #214

Platform specific largest floating point type #214

iyassou commented Jan 20, 2025

Platform specific largest floating point type #214

Are you sure you want to change the base?

Platform specific largest floating point type #214

Conversation

iyassou commented Jan 20, 2025

Summary

Test plan

6 failures from tests/metrics/test_synclib.py

2 failures from tests/metrics/test_toolkit.py

1 failure from tests/metrics/image/test_fid.py

Questions

torchaudio requirement

skimage requirement

Union type

Failed to recreate usage example

MPS-specific bug

6 failures from `tests/metrics/test_synclib.py`

2 failures from `tests/metrics/test_toolkit.py`

1 failure from `tests/metrics/image/test_fid.py`

`torchaudio` requirement

`skimage` requirement