Skip to content
This repository has been archived by the owner on Jun 17, 2024. It is now read-only.

window based operators #2

Open
TonyAbell opened this issue Dec 5, 2013 · 2 comments
Open

window based operators #2

TonyAbell opened this issue Dec 5, 2013 · 2 comments

Comments

@TonyAbell
Copy link

I am looking to do Tumbling Windows similar to what String Insight dose.

http://blogs.msdn.com/b/streaminsight/archive/2010/11/23/windows-part-1-the-basics.aspx

I do not currently see the operator in Naiad source.

Is it planned?

If not can you give guidance on how best to build it.

@mrry
Copy link
Contributor

mrry commented Dec 5, 2013

Hi Tony,

It should certainly be possible to build a tumbling window operator in Naiad, and there are many possible approaches, depending on the framework you use to write the rest of your program, and the window policy that you want to enact.

The basis of a tumbling window is a (custom) vertex that rewrites the timestamps of incoming messages (records) so that members of the same window have the same timestamp. Depending on the policy required, this vertex might gather all of the incoming messages in a single location to compute some predicate over them, it might be a binary vertex with a control input that determines when a new window should start, or it might even be part of a data-parallel stage wrapped in a feedback loop (so that parallel vertices could exchange data (e.g. counts)).

Once you produce a stream where the epoch number identifies each window, the existing frameworks can be used to process it. To use differential dataflow, you can Select each record of type R to make it a Weighted with weight 1, and pass this through a SlidingWindow(1) operator (which ensures that the stateful differential dataflow operators consider each window in isolation). You can also use the non-incremental Lindi framework straight out of the box, box it applies the operator logic independently on records with different timestamps.

Which of these cases best matches your problem?

@TonyAbell
Copy link
Author

I am looking to rolling aggregate counts.

Using twitter stream as an example, I would like to get the top N or all, '@mention' or '#hash' for a given time period minute, hour, day, week.

Is there documentation on the differences / characteristics between the Lindi vs. Differential Data Flow Frameworks?

From your explanation, I think the differential Data Flow would be most appropriate.

From what I understood in your comment I would need to derive a custom vertex from here:

https://github.com/MicrosoftResearchSVC/Naiad/blob/release_0.2/Naiad/Dataflow/Vertex.cs

Would I create an internal timer that would reset the windows, and rewrite the epoch's accordingly?

While using SlidingWindow(1) operator it, would only output results for the given epoch.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants