Relation filtering for edge methods + sparse metapath counting #12

AlCatt91 · 2024-10-14T11:45:58Z

Added the possibility to select the relation types you are interested in for the edge methods edge_degree_cardinality_summary, edge_pattern_summary. This allows in principle to reduce a bit the computational requirements (in particular, when computing triangles and metapaths) compared to working on the full dataframe.

I was also thinking that, in principle, it should be possible to compute also metapaths using sparse matmuls. The number of triangles is computed with (A@A) * (A > 0), with A the adjacency matrix with A[h,t] = #edges between h,t. If we take instead the 3D adjacency matrix B, with B[h,r,t] = 1 if there is an edge of type r between h and t, then the number of triangles of a given metapath can be computed with np.einsum("ijk,kxy->ijxy", B, B) * (A[:, None, None, :] > 0), which is a 4D matrix C with C[h, r1, r2, t] = #triangles of metapath r1-r2 between h and t... however, I don't think that scipy.sparse supports sparse 3D matrices and tensor contractions like this.

Fixed also the default value of composition_workers, which takes now into consideration the number of cores available (capped at 32, as that seems to be the optimal ballpark on larger nodes).

AlCatt91 · 2024-10-15T17:53:18Z

I managed in the end to refactor the metapath composition count with sparse matmuls - it seems to be much more performant, getting rid of the giant pandas merge. This also allows to easily compute separately the number of triangles for each metapath, which we didn't have before but might be useful (see the new method edge_metapath_count, which is then invoked in the edge_pattern_summary).
I realize the code is probably not very easy to parse, flattening the 3d matrices into sparse 2d leads to some indexing acrobatics, but it should be correct. The only thing that I don't like (and that is currently quite slow) is materializing the broadcasted (n_nodes * n_rels, n_nodes * n_rels) adjacency matrix adj_repeated necessary to do the masking (i.e., the (A[:, None, None, :] > 0) part)... it should not be necessary in principle, but haven't found a good way to go around it. I might think a bit more about this.

sbonner0 · 2024-10-18T09:20:21Z

Hey @AlCatt91 - thanks so much for taking a look at this!

I had a question about the new edge_metapath_count method: The n_triangles column counts how many of that particular metapath exist for the given edge? If so, I think we will need to update the new figure I made to account for this - else the count does not include all of the possible metapaths.

sbonner0 · 2024-10-18T09:23:35Z

You also mention this is invoked in the edge_pattern_summary but I don't see the count included there?

sbonner0 · 2024-10-18T09:30:43Z

Also as a quick speed test, I am running the following code block:

hetionet_df = pd.DataFrame(
    torch.load("./data/hetionet/triples.pt"), columns=["h", "r", "t"], dtype="int32"
).drop_duplicates()

df = KGTopologyToolbox(hetionet_df).edge_pattern_summary(
    return_metapath_list=True, composition_workers=4
)

On the current version of the package, this is taking 1m 30s - when I switch to this branch, it is taking nearly 4m. Are you seeing similar speeds?

sbonner0 · 2024-10-18T10:11:14Z

Also, when I try to run this with PrimeKG, I get the following error: ValueError: negative dimensions are not allowed

AlCatt91 · 2024-10-18T10:41:45Z

Hi Stephen, thanks for trying this out!

Hey @AlCatt91 - thanks so much for taking a look at this!

I had a question about the new edge_metapath_count method: The n_triangles column counts how many of that particular metapath exist for the given edge? If so, I think we will need to update the new figure I made to account for this - else the count does not include all of the possible metapaths.

In the new method, n_triangles is the count of compositions for the specific metapath (you'll have multiple rows for the same (h,r,t) edge, one for each metapath (r1,r2) that has at least one triangle over that edge). What's wrong with the figure, exactly? Are you not finding the same number of metapaths as before?

You also mention this is invoked in the edge_pattern_summary but I don't see the count included there?

I was invoking it directly but then realized it was more efficient to just call composition_count (see the change in commit ea845ab). So, the output of edge_pattern_summary is exactly the same as before, with one row per edge, because we are reducing the number of triangles in n_triangles by summing over all the possible metapaths there. If you want the individual count for each metapath, edge_metapath_count should be used - let me know if you think this should be changed!

Quite strange about the speeds you are reporting - on all the tests I ran on different KGs, the new implementation was always faster, both with 4 composition workers and the default value (32 on the node I'm using). On Hetionet, the speeds I see are quite similar (2'40s for the new implementation, 2'55s for the old one - the 1'30s you reported seems a bit to fast to me, are you sure of that?), but on graphs with more edges the speedups become more significant (~25% faster on PrimeKG).
Also, I can't reproduce the error on PrimeKG. Could you confirm that you are using the new version of utils.py (if you have the old package installed in your venv, and are switching to the new branch, it could be necessary to change L15 of topology_toolbox.py from
from kg_topology_toolbox.utils import (
to
from utils import (
otherwise you will still be using the old implementation of composition_count in your installed package).
Are you getting the ValueError right away, when running edge_pattern_summary?

AlCatt91 · 2024-10-18T11:56:08Z

Oh wait, with a clean install of the package I am also getting the ValueError: negative dimensions are not allowed for PrimeKG - I think something might have changed in the dependencies

AlCatt91 · 2024-10-18T13:44:22Z

Based on the discussion, I tried running the comparison also on an arm64 machine on AWS, and on Hetionet the new version is still slightly faster than the old one (and both faster than my previous benchmarks on x86) -- so I would rule out the need to choose between pandas and scipy.sparse based on architecture... BUT, monitoring memory usage, the new implementation peaks at ~51GB of RAM, while the old one peaked at ~17GB, so on machines with less memory it's not surprising that the new one might be slower, if it needs to use swap (as in your case, Stephen). I believe this is due to the horrible hack with materializing the broadcasted adjacency matrix (as I was remarking here #12 (comment)), which I'm convinced should be avoidable. I'll try to sort this out.

AlCatt91 · 2024-10-18T17:29:31Z

I did some refactoring: instead of materializing the (n_nodes * n_rels, n_nodes * n_rels) adjecency matrix, I reduced it to a (n_nodes * n_rels, n_nodes) boolean mask (which is then sliced/wrapped appropriately for each slice of the big matrix). It's likely possible to do even better, but memory usage is already much lighter. On Hetionet, I am seeing now a peak memory usage of ~20GB (compared to ~17GB of current package), and signficant speedups (1'55s vs the previous 2'50s). On the larger PrimeKG, the sparse implementation begins to pay dividends vs pandas merge, with peak usage << than current package (~55GB vs ~87GB) and much faster (4'11s vs 9'7s). @sbonner0 could you please test this new version on your machine when you have time, and see how it performs?

sbonner0 · 2024-10-24T10:35:38Z

Hey @AlCatt91 - I have tried this on my machine and results look promising. Hetionet is making roughly the same time between the current main and this branch (~1:30), however primekg is taking under half as much time as before -- really nice work!

sbonner0

This all looks great @AlCatt91 - thanks so much for improving the code.

The code is getting quite complex however - I wonder if, once we are happy it is somewhat stable, we should try and comment it a bit more?

AlCatt91 · 2024-10-24T11:04:20Z

Cheers Stephen for checking and for the feedback! I agree, some more comments would help readability - I'll add a few clarifications

AlCatt91 · 2024-10-28T10:33:33Z

I added a few docstrings to improve documentation - and included the new features in the walkthrough notebook. Let me know if there is anything else that should addressed before merging :)

sbonner0

I like the expanded doc strings and comments - thanks @AlCatt91 !

danjust

Thanks Alberto, LGTM - just a very minor change to reduce redundant code. The edge_metapath_count function also introduces some redundancy and maybe confusion but feel free to leave that as it is for now.

Feel free to merge if my commit looks ok to you (and revert that one + merge if not)

AlCatt91 · 2024-11-08T14:52:31Z

Thanks, Daniel - agree on the change you made! Also agree on the code repetition in the new edge_metapath_count method, I realize it's not optimal... For the time being I'll leave it as it is, but consider extracting the common part in a cache-ble method in the future.

AlCatt91 added 4 commits October 11, 2024 11:05

filter relations for edge card; cap mp workers based on cores

590559e

add relation filter to edge methods

4612c0e

ci fix

586e816

metapath tweaks

432615f

AlCatt91 requested review from sbonner0 and danjust October 14, 2024 11:49

AlCatt91 added 2 commits October 15, 2024 17:44

refactor metapath counting with sparse matmuls

7c6d257

docstring update

fadc967

AlCatt91 added 3 commits October 16, 2024 09:01

use np.divmod

c49e886

avoid repeated work

ea845ab

add metapath unit test

05e3ee3

AlCatt91 changed the title ~~Add relation filtering for edge methods~~ Relation filtering for edge methods + sparse metapath counting Oct 18, 2024

reduce memory usage of metapath counting

5ec0f3e

sbonner0 approved these changes Oct 24, 2024

View reviewed changes

AlCatt91 added 2 commits October 25, 2024 14:19

improve docstrings

ce6f369

document new functionalities in the doc notebook

d6effae

sbonner0 approved these changes Oct 28, 2024

View reviewed changes

tidy up redundant code

2a71340

danjust approved these changes Nov 8, 2024

View reviewed changes

fix typo

9862ce0

AlCatt91 merged commit acdcf42 into main Nov 8, 2024
1 check passed

AlCatt91 deleted the filter_relations branch November 8, 2024 14:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relation filtering for edge methods + sparse metapath counting #12

Relation filtering for edge methods + sparse metapath counting #12

AlCatt91 commented Oct 14, 2024

AlCatt91 commented Oct 15, 2024

sbonner0 commented Oct 18, 2024

sbonner0 commented Oct 18, 2024

sbonner0 commented Oct 18, 2024

sbonner0 commented Oct 18, 2024

AlCatt91 commented Oct 18, 2024

AlCatt91 commented Oct 18, 2024

AlCatt91 commented Oct 18, 2024 •

edited

Loading

AlCatt91 commented Oct 18, 2024

sbonner0 commented Oct 24, 2024

sbonner0 left a comment

AlCatt91 commented Oct 24, 2024

AlCatt91 commented Oct 28, 2024

sbonner0 left a comment

danjust left a comment

AlCatt91 commented Nov 8, 2024

Relation filtering for edge methods + sparse metapath counting #12

Relation filtering for edge methods + sparse metapath counting #12

Conversation

AlCatt91 commented Oct 14, 2024

AlCatt91 commented Oct 15, 2024

sbonner0 commented Oct 18, 2024

sbonner0 commented Oct 18, 2024

sbonner0 commented Oct 18, 2024

sbonner0 commented Oct 18, 2024

AlCatt91 commented Oct 18, 2024

AlCatt91 commented Oct 18, 2024

AlCatt91 commented Oct 18, 2024 • edited Loading

AlCatt91 commented Oct 18, 2024

sbonner0 commented Oct 24, 2024

sbonner0 left a comment

Choose a reason for hiding this comment

AlCatt91 commented Oct 24, 2024

AlCatt91 commented Oct 28, 2024

sbonner0 left a comment

Choose a reason for hiding this comment

danjust left a comment

Choose a reason for hiding this comment

AlCatt91 commented Nov 8, 2024

AlCatt91 commented Oct 18, 2024 •

edited

Loading