Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relation filtering for edge methods + sparse metapath counting #12

Merged
merged 14 commits into from
Nov 8, 2024

Conversation

AlCatt91
Copy link
Collaborator

Added the possibility to select the relation types you are interested in for the edge methods edge_degree_cardinality_summary, edge_pattern_summary. This allows in principle to reduce a bit the computational requirements (in particular, when computing triangles and metapaths) compared to working on the full dataframe.

I was also thinking that, in principle, it should be possible to compute also metapaths using sparse matmuls. The number of triangles is computed with (A@A) * (A > 0), with A the adjacency matrix with A[h,t] = #edges between h,t. If we take instead the 3D adjacency matrix B, with B[h,r,t] = 1 if there is an edge of type r between h and t, then the number of triangles of a given metapath can be computed with np.einsum("ijk,kxy->ijxy", B, B) * (A[:, None, None, :] > 0), which is a 4D matrix C with C[h, r1, r2, t] = #triangles of metapath r1-r2 between h and t... however, I don't think that scipy.sparse supports sparse 3D matrices and tensor contractions like this.

Fixed also the default value of composition_workers, which takes now into consideration the number of cores available (capped at 32, as that seems to be the optimal ballpark on larger nodes).

@AlCatt91 AlCatt91 requested review from sbonner0 and danjust October 14, 2024 11:49
@AlCatt91
Copy link
Collaborator Author

I managed in the end to refactor the metapath composition count with sparse matmuls - it seems to be much more performant, getting rid of the giant pandas merge. This also allows to easily compute separately the number of triangles for each metapath, which we didn't have before but might be useful (see the new method edge_metapath_count, which is then invoked in the edge_pattern_summary).
I realize the code is probably not very easy to parse, flattening the 3d matrices into sparse 2d leads to some indexing acrobatics, but it should be correct. The only thing that I don't like (and that is currently quite slow) is materializing the broadcasted (n_nodes * n_rels, n_nodes * n_rels) adjacency matrix adj_repeated necessary to do the masking (i.e., the (A[:, None, None, :] > 0) part)... it should not be necessary in principle, but haven't found a good way to go around it. I might think a bit more about this.

@sbonner0
Copy link
Collaborator

Hey @AlCatt91 - thanks so much for taking a look at this!

I had a question about the new edge_metapath_count method: The n_triangles column counts how many of that particular metapath exist for the given edge? If so, I think we will need to update the new figure I made to account for this - else the count does not include all of the possible metapaths.

@sbonner0
Copy link
Collaborator

You also mention this is invoked in the edge_pattern_summary but I don't see the count included there?

@sbonner0
Copy link
Collaborator

Also as a quick speed test, I am running the following code block:

hetionet_df = pd.DataFrame(
    torch.load("./data/hetionet/triples.pt"), columns=["h", "r", "t"], dtype="int32"
).drop_duplicates()

df = KGTopologyToolbox(hetionet_df).edge_pattern_summary(
    return_metapath_list=True, composition_workers=4
)

On the current version of the package, this is taking 1m 30s - when I switch to this branch, it is taking nearly 4m. Are you seeing similar speeds?

@sbonner0
Copy link
Collaborator

Also, when I try to run this with PrimeKG, I get the following error: ValueError: negative dimensions are not allowed

@AlCatt91
Copy link
Collaborator Author

Hi Stephen, thanks for trying this out!

Hey @AlCatt91 - thanks so much for taking a look at this!

I had a question about the new edge_metapath_count method: The n_triangles column counts how many of that particular metapath exist for the given edge? If so, I think we will need to update the new figure I made to account for this - else the count does not include all of the possible metapaths.

In the new method, n_triangles is the count of compositions for the specific metapath (you'll have multiple rows for the same (h,r,t) edge, one for each metapath (r1,r2) that has at least one triangle over that edge). What's wrong with the figure, exactly? Are you not finding the same number of metapaths as before?

You also mention this is invoked in the edge_pattern_summary but I don't see the count included there?

I was invoking it directly but then realized it was more efficient to just call composition_count (see the change in commit ea845ab). So, the output of edge_pattern_summary is exactly the same as before, with one row per edge, because we are reducing the number of triangles in n_triangles by summing over all the possible metapaths there. If you want the individual count for each metapath, edge_metapath_count should be used - let me know if you think this should be changed!

Quite strange about the speeds you are reporting - on all the tests I ran on different KGs, the new implementation was always faster, both with 4 composition workers and the default value (32 on the node I'm using). On Hetionet, the speeds I see are quite similar (2'40s for the new implementation, 2'55s for the old one - the 1'30s you reported seems a bit to fast to me, are you sure of that?), but on graphs with more edges the speedups become more significant (~25% faster on PrimeKG).
Also, I can't reproduce the error on PrimeKG. Could you confirm that you are using the new version of utils.py (if you have the old package installed in your venv, and are switching to the new branch, it could be necessary to change L15 of topology_toolbox.py from
from kg_topology_toolbox.utils import (
to
from utils import (
otherwise you will still be using the old implementation of composition_count in your installed package).
Are you getting the ValueError right away, when running edge_pattern_summary?

@AlCatt91
Copy link
Collaborator Author

Oh wait, with a clean install of the package I am also getting the ValueError: negative dimensions are not allowed for PrimeKG - I think something might have changed in the dependencies

@AlCatt91
Copy link
Collaborator Author

AlCatt91 commented Oct 18, 2024

Based on the discussion, I tried running the comparison also on an arm64 machine on AWS, and on Hetionet the new version is still slightly faster than the old one (and both faster than my previous benchmarks on x86) -- so I would rule out the need to choose between pandas and scipy.sparse based on architecture... BUT, monitoring memory usage, the new implementation peaks at ~51GB of RAM, while the old one peaked at ~17GB, so on machines with less memory it's not surprising that the new one might be slower, if it needs to use swap (as in your case, Stephen). I believe this is due to the horrible hack with materializing the broadcasted adjacency matrix (as I was remarking here #12 (comment)), which I'm convinced should be avoidable. I'll try to sort this out.

@AlCatt91 AlCatt91 changed the title Add relation filtering for edge methods Relation filtering for edge methods + sparse metapath counting Oct 18, 2024
@AlCatt91
Copy link
Collaborator Author

I did some refactoring: instead of materializing the (n_nodes * n_rels, n_nodes * n_rels) adjecency matrix, I reduced it to a (n_nodes * n_rels, n_nodes) boolean mask (which is then sliced/wrapped appropriately for each slice of the big matrix). It's likely possible to do even better, but memory usage is already much lighter. On Hetionet, I am seeing now a peak memory usage of ~20GB (compared to ~17GB of current package), and signficant speedups (1'55s vs the previous 2'50s). On the larger PrimeKG, the sparse implementation begins to pay dividends vs pandas merge, with peak usage << than current package (~55GB vs ~87GB) and much faster (4'11s vs 9'7s). @sbonner0 could you please test this new version on your machine when you have time, and see how it performs?

@sbonner0
Copy link
Collaborator

Hey @AlCatt91 - I have tried this on my machine and results look promising. Hetionet is making roughly the same time between the current main and this branch (~1:30), however primekg is taking under half as much time as before -- really nice work!

Copy link
Collaborator

@sbonner0 sbonner0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks great @AlCatt91 - thanks so much for improving the code.

The code is getting quite complex however - I wonder if, once we are happy it is somewhat stable, we should try and comment it a bit more?

@AlCatt91
Copy link
Collaborator Author

Cheers Stephen for checking and for the feedback! I agree, some more comments would help readability - I'll add a few clarifications

@AlCatt91
Copy link
Collaborator Author

I added a few docstrings to improve documentation - and included the new features in the walkthrough notebook. Let me know if there is anything else that should addressed before merging :)

Copy link
Collaborator

@sbonner0 sbonner0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the expanded doc strings and comments - thanks @AlCatt91 !

Copy link
Contributor

@danjust danjust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Alberto, LGTM - just a very minor change to reduce redundant code. The edge_metapath_count function also introduces some redundancy and maybe confusion but feel free to leave that as it is for now.

Feel free to merge if my commit looks ok to you (and revert that one + merge if not)

@AlCatt91
Copy link
Collaborator Author

AlCatt91 commented Nov 8, 2024

Thanks, Daniel - agree on the change you made! Also agree on the code repetition in the new edge_metapath_count method, I realize it's not optimal... For the time being I'll leave it as it is, but consider extracting the common part in a cache-ble method in the future.

@AlCatt91 AlCatt91 merged commit acdcf42 into main Nov 8, 2024
1 check passed
@AlCatt91 AlCatt91 deleted the filter_relations branch November 8, 2024 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants