Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed-up samplers by avoiding backwards seeks #245

Merged
merged 29 commits into from
Oct 7, 2024
Merged

Conversation

NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Oct 7, 2024

We now first sort and dedup the frame indices to be decoded within the sampler (see code for details). We can still implement this in C++ to optimize this further but this already leads to strong speed-ups.

Benchmark results - TL;DR: 5X faster when num_clips is large.

Using the following values:
sampler = clips_at_random_indices
num_frames_per_clip = 10
num_indices_between_frames = 2


num_clips = 1
when num_clips=1 there should be no speed-up.
We just need to make sure the new logic didn't add any overhead
main: med = 92.16ms +- 9.36
PR:   med = 89.27ms +- 7.90


num_clips = 50
With num_clips = 50 there are potentially a lot of overlap and backwards-seeks.
We expect significant speed-ups.
main: med = 1527.26ms +- 839.48
PR:   med = 331.37ms +- 170.58

Benchmark code:

from torchcodec.decoders import VideoDecoder
from torchcodec.samplers import clips_at_random_indices
import torch
from time import perf_counter_ns

def bench(f, *args, num_exp=100, warmup=0, **kwargs):

    for _ in range(warmup):
        f(*args, **kwargs)

    times = []
    for _ in range(num_exp):
        start = perf_counter_ns()
        f(*args, **kwargs)
        end = perf_counter_ns()
        times.append(end - start)
    return torch.tensor(times).float()

def report_stats(times, unit="ms"):
    mul = {
        "ns": 1,
        "µs": 1e-3,
        "ms": 1e-6,
        "s": 1e-9,
    }[unit]
    times = times * mul
    std = times.std().item()
    med = times.median().item()
    print(f"{med = :.2f}{unit} +- {std:.2f}")
    return med


def sample():
    decoder = VideoDecoder("test/resources/nasa_13013.mp4")
    clips_at_random_indices(
        decoder,
        num_clips=1,
        num_frames_per_clip=10,
        num_indices_between_frames=2,
    )

times = bench(sample, num_exp=30, warmup=2)
report_stats(times, unit="ms")

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 7, 2024
@NicolasHug NicolasHug marked this pull request as ready for review October 7, 2024 16:15
decoded_frame = decoder.get_frame_at(index=frame_index)
previous_decoded_frame = decoded_frame
all_decoded_frames[j] = decoded_frame

all_clips: list[list[Frame]] = chunk_list(
all_decoded_frames, chunk_size=num_frames_per_clip
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we don't have to chunk the clips. The implementation already allows us to return a single 5D FrameBatch instead of a list[4D FrameBatch]. I'll just leave this for another PR so we can discuss.

and frame_index == all_clips_indices_sorted[i - 1]
):
# Avoid decoding the same frame twice.
decoded_frame = previous_decoded_frame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is setting it to the same python object, right?

Will there be any issues with that? Example, if the user modifies that tensor or something else in FrameBatch -- they will modify both entries in the list, right?

decoder.get_frame_at(index) for index in all_clips_indices
]
all_clips_indices_sorted, argsort = zip(
*sorted((j, i) for (i, j) in enumerate(all_clips_indices))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: i, j makes both look like indexes in the same range. Maybe call them batch_index, frame_index?

and frame_index == all_clips_indices_sorted[i - 1]
):
# Avoid decoding the same frame twice.
decoded_frame = previous_decoded_frame
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be slightly safer w.r.t. future changes this should be

decoded_frame = copy(previous_decoded_frame)

but we don't implement __copy__ on Frame.

Note that a copy still happens within to_framebatch, so this is currently safe, but admittedly subject to an implementation detail that will change.

We can either:

  • be OK with this since we'll re-implement it in C++ anyway
  • implement __copy__.

LMK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK with this as-is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll add a comment in to_framebatch() so we don't accidentally mess it up.

@ahmadsharif1
Copy link
Contributor

optionally you could also check in the benchmark code because we probably want to track the performance of this and make sure it doesn't regress.

@NicolasHug
Copy link
Member Author

Sounds good, let me do that in another PR. I tried doing it here but this created a lot of undesirable changes since we already have a benchmark_samplers.py file, which I think we should remove.

@NicolasHug NicolasHug merged commit b65882e into main Oct 7, 2024
22 checks passed
@NicolasHug NicolasHug deleted the samplers_fast branch October 7, 2024 17:25
@NicolasHug NicolasHug mentioned this pull request Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants