-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relation filtering for edge methods + sparse metapath counting #12
Conversation
I managed in the end to refactor the metapath composition count with sparse matmuls - it seems to be much more performant, getting rid of the giant pandas merge. This also allows to easily compute separately the number of triangles for each metapath, which we didn't have before but might be useful (see the new method |
Hey @AlCatt91 - thanks so much for taking a look at this! I had a question about the new |
You also mention this is invoked in the |
Also as a quick speed test, I am running the following code block: hetionet_df = pd.DataFrame(
torch.load("./data/hetionet/triples.pt"), columns=["h", "r", "t"], dtype="int32"
).drop_duplicates()
df = KGTopologyToolbox(hetionet_df).edge_pattern_summary(
return_metapath_list=True, composition_workers=4
) On the current version of the package, this is taking 1m 30s - when I switch to this branch, it is taking nearly 4m. Are you seeing similar speeds? |
Also, when I try to run this with PrimeKG, I get the following error: |
Hi Stephen, thanks for trying this out!
In the new method,
I was invoking it directly but then realized it was more efficient to just call Quite strange about the speeds you are reporting - on all the tests I ran on different KGs, the new implementation was always faster, both with 4 composition workers and the default value (32 on the node I'm using). On Hetionet, the speeds I see are quite similar (2'40s for the new implementation, 2'55s for the old one - the 1'30s you reported seems a bit to fast to me, are you sure of that?), but on graphs with more edges the speedups become more significant (~25% faster on PrimeKG). |
Oh wait, with a clean install of the package I am also getting the |
Based on the discussion, I tried running the comparison also on an arm64 machine on AWS, and on Hetionet the new version is still slightly faster than the old one (and both faster than my previous benchmarks on x86) -- so I would rule out the need to choose between pandas and scipy.sparse based on architecture... BUT, monitoring memory usage, the new implementation peaks at ~51GB of RAM, while the old one peaked at ~17GB, so on machines with less memory it's not surprising that the new one might be slower, if it needs to use swap (as in your case, Stephen). I believe this is due to the horrible hack with materializing the broadcasted adjacency matrix (as I was remarking here #12 (comment)), which I'm convinced should be avoidable. I'll try to sort this out. |
I did some refactoring: instead of materializing the (n_nodes * n_rels, n_nodes * n_rels) adjecency matrix, I reduced it to a (n_nodes * n_rels, n_nodes) boolean mask (which is then sliced/wrapped appropriately for each slice of the big matrix). It's likely possible to do even better, but memory usage is already much lighter. On Hetionet, I am seeing now a peak memory usage of ~20GB (compared to ~17GB of current package), and signficant speedups (1'55s vs the previous 2'50s). On the larger PrimeKG, the sparse implementation begins to pay dividends vs pandas merge, with peak usage << than current package (~55GB vs ~87GB) and much faster (4'11s vs 9'7s). @sbonner0 could you please test this new version on your machine when you have time, and see how it performs? |
Hey @AlCatt91 - I have tried this on my machine and results look promising. Hetionet is making roughly the same time between the current main and this branch (~1:30), however primekg is taking under half as much time as before -- really nice work! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks great @AlCatt91 - thanks so much for improving the code.
The code is getting quite complex however - I wonder if, once we are happy it is somewhat stable, we should try and comment it a bit more?
Cheers Stephen for checking and for the feedback! I agree, some more comments would help readability - I'll add a few clarifications |
I added a few docstrings to improve documentation - and included the new features in the walkthrough notebook. Let me know if there is anything else that should addressed before merging :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the expanded doc strings and comments - thanks @AlCatt91 !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Alberto, LGTM - just a very minor change to reduce redundant code. The edge_metapath_count
function also introduces some redundancy and maybe confusion but feel free to leave that as it is for now.
Feel free to merge if my commit looks ok to you (and revert that one + merge if not)
Thanks, Daniel - agree on the change you made! Also agree on the code repetition in the new |
Added the possibility to select the relation types you are interested in for the edge methods
edge_degree_cardinality_summary
,edge_pattern_summary
. This allows in principle to reduce a bit the computational requirements (in particular, when computing triangles and metapaths) compared to working on the full dataframe.I was also thinking that, in principle, it should be possible to compute also metapaths using sparse matmuls. The number of triangles is computed with (A@A) * (A > 0), with A the adjacency matrix with A[h,t] = #edges between h,t. If we take instead the 3D adjacency matrix B, with B[h,r,t] = 1 if there is an edge of type r between h and t, then the number of triangles of a given metapath can be computed with
np.einsum("ijk,kxy->ijxy", B, B) * (A[:, None, None, :] > 0)
, which is a 4D matrix C with C[h, r1, r2, t] = #triangles of metapath r1-r2 between h and t... however, I don't think that scipy.sparse supports sparse 3D matrices and tensor contractions like this.Fixed also the default value of
composition_workers
, which takes now into consideration the number of cores available (capped at 32, as that seems to be the optimal ballpark on larger nodes).