Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Samplers: add support for edge-case policies #241

Merged
merged 20 commits into from
Oct 7, 2024
Merged

Conversation

NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Oct 4, 2024

This PR addresses the first big item in #239, namely: support a "policy" to apply when the sampler tries to sample a frame beyond the end of the video. We don't automatically error anymore, which leads to more flexibility. As discussed offline, the available policies are "repeat_last", "wrap", and "error".

Important points:

  • I factorized the samplers into this _abtract_sampler() helper function. The only difference between the 2 samplers is the call to torch.randint vs torch.linspace, and I wanted to make that clear.
  • The different policies are implemented as well-defined Python function. This leaves the door open for us to let the user define their own policies by providing a function, should we want to in the future.
  • The crux of this PR is the _build_all_clips_indices() function. It's simple, but it does a lot of things. Look at it carefully.
  • The other important part is the _decode_all_clips_indices() function, which we plan to re-write in a much more efficient way, with a C++ core.

Note: This PR is already complex enough, so I have not implemented the "try to reduce num_indices_between_frames" strategy that I originally suggested #221 (comment). This is orthogonal to the policies implemented here, so can revisit later if we want it.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 4, 2024
@NicolasHug NicolasHug marked this pull request as ready for review October 7, 2024 10:41
Comment on lines +191 to +192
# Important note: sampling_range_end defines the upper bound of where a clip
# can *start*, not where a clip can end.
Copy link
Member Author

@NicolasHug NicolasHug Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fairly important, and something we haven't made explicit yet in our discussions. The reason it has to be defined as the upper bound of where a clip can start is because if sampling_range_end defined the upper bound of where a clip can end, then the last few frames in that sampling range will always have a lower probability of being sampled. I.e. we would always be in this scenario that this comment pointed out #221 (comment).

If sampling_range_end defines the upper bound of where a clip can end, then technically, there really there is no need for a policy. Hope that makes sense.

Copy link
Contributor

@scotts scotts Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost entirely agreed, but the wrinkle is that even if we defined sampling_range_end to be the upper bound of where the last clip and end, we'd still need a policy unless we also defined len(decoder) as the default last possible frame for the end of a clip. If we did that for both, we'd also ensure only whole clips, and then we would not need a policy.

Agreed that this is the right approach. We'll need to document carefully. :) I also think there's no way around needing to document carefully; whatever behavior we choose will end up with subtle behavior.

Copy link
Contributor

@ahmadsharif1 ahmadsharif1 Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to prove to myself that that comment about probabilities made sense so I wrote a simple simulator:

image

It seems like the distribution is symmetric in the sense that even the start few frames have smaller chance of being selected.

I think that is because when you pick frame X you pick frame X + 1 as well, but you never pick frame -1 for example. So frames at the start also have lower chance. Note that this with with 0 dilation so that may change up things too.

I don't know if it breaks your assumptions but you can think about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may make sense to have a "slow" test with a simulator to see if the fair policy is within some threshold of an "even" distribution.

This comment isn't blocking or anything. It was mostly for my own curiosity since we don't have a closed-form mathematical expression for the probability of being selected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it's true that the first frames are also less likely to be sampled. Oh boy. I don't think we can or should do anything about that side though?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could make your code symmetric to pick those "negative frames" and then repeat them in the other direction (i.e. repeat the first frame) with your policy code. That will make it symmetric.

Again, not blocking, but worth a consideration if you want to make it symmetric and fair.

Copy link
Contributor

@ahmadsharif1 ahmadsharif1 Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind of like how Conv1D adds padding to both sides:

https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html

You would have to do something symmetric like repeating frames on the left and right boundaries to make it truly fair. Not sure if it's worth the effort.

Copy link
Member Author

@NicolasHug NicolasHug Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may make sense to have a "slow" test with a simulator to see if the fair policy is within some threshold of an "even" distribution.

I did write a test to illustrate the difference of behavior depending on the value of sampling_range_end.

Writing an actual statistical test is challenging because in fact when sampling_range_end=len(decoder), the last frames tend to be over-represented (because they are repeated). I.e. the frames still don't follow a uniform probability, there's no distribution to check against.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it a bit more, if you want frames to all be totally equal in terms of output probability, you will have to pick first first and last few frames with higher probability. It's not as simple as repeating the last frame, etc. because the second last frame also needs to be weighted higher, etc.

And that's why your test doesn't follow a uniform distribution.

For now this implementation is easy and simple. Later on if there is a need, we can implement something complex and "fair".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, enforcing a uniform distribution across all frames is pretty hard.
That being said, we can already enforce a uniform distribution for clip starts: all the user has to do is to set sampling_range_end=len(decoder). With that, all the frames have an equal probability of being picked to start a clip.


return [to_framebatch(clip) for clip in all_clips]


Copy link
Contributor

@scotts scotts Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think a better name is _generic_sampler(). I tend to think of things called "abstract" as defining desired behavior without an actual implementation. I think of things that are called "generic" as having an implementation that is generic to one or several parameters.

if sampling_range_end is None:
sampling_range_end = num_frames - clip_span + 1
sampling_range_end = max(num_frames_in_video - clip_span + 1, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the discussion below, shouldn't this just be num_frames_in_video? That is, the default last index at which a clip can start is the final frame in the video. And if we're worried that num_frames_in_video is not at least 1, maybe we should explicitly check for (and error?) on that?

Even if we decide line 56 is correct, I can't reconcile it and 67. There, it looks like we're making sure that sampling_range_end is not larger than the video itself - which makes sense to me. But there, we don't subtract off the size of a clip, but here at line 56, we do.

Copy link
Member Author

@NicolasHug NicolasHug Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the lack of documentation is making this significantly harder to review, sorry about that.

So, my reasoning is the following:

  • what happens at line 56 is the default behavior. In line 67, this isn't the default behavior anymore, sampling_range_end is passed by the user
  • the default behavior (56) is: "automatically set the upper bound of where a clip can start to a value such that we never try to sample beyond the last frame." That's why we have to deduct clip_span from num_frames_in_video.
  • the reason this is the default behavior is because there is no way for users to explicitly set such value, unless they do the math of computing the clip_span (and we don't want them to do that I think?).
  • Now, at line 67 when the user explicitly passes sampling_range_end, we just make sure it doesn't go above num_frames_in_video to avoid errors on short videos. If we were to substract clip_span there, it would change the semantic of sampling_range_end from "upper bound of the start" to "upper bound of the end"

Basically:

  • default behavior means that the last few frames are less likely to be sampled, and policy doesn't come into play, but there are no "degenerate" clips (degenerate = policy had to be applied)
  • if user explicitly passes a value, then last few frames probability may increase (depending on the value), and policies become relevant. Clips may be degenrate.

Copy link
Contributor

@scotts scotts Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that all tracks. I think that is consistent with what I surmised below, when I read the test. I do have a slight preference for what I explained, but it's only a slight preference and not blocking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG, thanks. I don't have a super strong preference either. Let me merge now to unblock, but happy to revisit.

Comment on lines +166 to +168
# TODO:
# - sort the frames to avoid backward seeks, dedup, decode, and re-organize frames.
# - write most of this in C++
Copy link
Member Author

@NicolasHug NicolasHug Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want a peek of what this will look like: 3dcbe1e (#245)

I'm leaving it for a follow-up PR to ease reviewing. Basic benchmark shows 5X speedups.


def _error_policy(frames_indices: list[int], num_frames_per_clip: int) -> list[int]:
raise ValueError(
"You set the 'error' policy, and the sampler tried the decode a frame "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"tried the decode" -> "tried to decode"

@@ -74,130 +88,211 @@ def _get_clip_span(*, num_indices_between_frames, num_frames_per_clip):
return num_indices_between_frames * (num_frames_per_clip - 1) + 1


def clips_at_random_indices(
def _repeat_last_policy(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ❤️ the elegance of these policies.

num_indices_between_frames=num_indices_between_frames,
num_frames_per_clip=num_frames_per_clip,
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that this loop is so easy to implement - we can just call Python's range() directly, and how we're using our policy is obvious and still generic - makes me more confident we're doing the right thing.

@@ -131,6 +134,80 @@ def test_sampling_range_negative(sampler):
assert_tensor_equal(clip.data, clips_1[0].data)


@pytest.mark.parametrize("sampler", (clips_at_random_indices, clips_at_regular_indices))
def test_sampling_range_default_behavior(sampler):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, based on this comment, I think that explains the behavior I was confused about in my comment above.

This means that effectively, the default values for the sampler are:

  sampling_range_start = 0,
  sampling_range_end = max(len(decoder) - clip_span + 1, 1),

I do think it may be simpler to just make them:

  sampling_range_start = 0,
  sampling_range_end = len(decoder),

And then let the policy deal with the case that we start a clip in the range [len(encoder) - clip_size, len(encoder)). (I may be off by 1 somewhere.) I don't have a strong preference for this, but I think it will make our code more straight-forward, and I think it's easier to explain the behavior. I'm unsure if it's actually a better result for the user, though.

# on the default behavior.
# Note: ideally we would directly assert the number of occurrences of the
# last frame based on its index, but our APIs don't return indices, only pts
# (yet?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should do that. Created #246.

@NicolasHug NicolasHug merged commit 5a69001 into main Oct 7, 2024
17 checks passed
@NicolasHug NicolasHug deleted the samplers_edge_case branch October 7, 2024 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants