Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP - run 'runtime-tests' on Android #746

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

andreaTP
Copy link
Collaborator

Opening for visibility @yigit and @evacchi .

I'm not done with this:

  • I have had headaches trying to get a correct classpath for wasi (Wat2Wasm), decided to give up and disabled the tests, @yigit help here will be appreciated 🙏
  • I keep having an OOM that causes a failure on the skip-stack-guard-page test (it's checking the call stack exhausted)
  • The tests are taking considerable time to run now, I'd like to consider running only a linter on PR and the full tests as a nightly

Thoughts?

@yigit
Copy link
Contributor

yigit commented Jan 22, 2025

Have you tried running this locally?

./gradlew device-tests:connectedRuntimeTestsDebugAndroidTest -Pandroid.testInstrumentationRunnerArguments.class=com.dylibso.chicory.test.gen.SpecV1SkipStackGuardPageTest#skip-stack-guard-page.wast

I'm wondering if we are leaking some memory during tests or it is just this test itself.

@evacchi
Copy link
Collaborator

evacchi commented Jan 22, 2025

just curious, I was expecting the time to increase a lot, how long is it taking now? Anyway, +1 on running nightly and/or on-demand.

@andreaTP
Copy link
Collaborator Author

Have you tried running this locally?

Yes, same result, but thanks for the command line, it's helpful to continue debugging.

I was expecting the time to increase a lot

It takes approx 10 minutes to get to the end of the tests.

@evacchi
Copy link
Collaborator

evacchi commented Jan 22, 2025

Not too bad after all :) but it still probably makes sense to make it optional/nightly for now

@yigit
Copy link
Contributor

yigit commented Jan 23, 2025

I can take a look at the test time eventually.
If we can make the maven task produce reproducible outputs (e.g. not snapshots and no timestamps in jars), we should be able to setup gradle cache to avoid unnecessary runs.

We can also shard and/or run tests in parallel.
But until then, nightly seems fine.

There is always the option to use firebase test lab to further speed them up. It will not cost much, but that would be a consideration for later. Just mentioning it for reference that we have options if we need these tests to run fast.

@evacchi
Copy link
Collaborator

evacchi commented Jan 23, 2025

./gradlew device-tests:connectedRuntimeTestsDebugAndroidTest -Pandroid.testInstrumentationRunnerArguments.class=com.dylibso.chicory.test.gen.SpecV1SkipStackGuardPageTest#skip-stack-guard-page.wast

yes! the test actually passes with:

./gradlew device-tests:connectedRuntimeTestsDebugAndroidTest -Pandroid.testInstrumentationRunnerArguments.class=com.dylibso.chicory.test.gen.SpecV1SkipStackGuardPageTest

so we pair-debugged this a little and it looks like the main issue is memory pressure because a lot of tests are run in a small amount of time. We tried adding a Runtime.getRuntime().gc() in a strategic place (TestModule) but probably the issue is scoped to that particular test (so far) because it intentionally grows the virtual stack of the interpreter, causing a lot of garbage to be retained.

so, either we find a way to setup the VM so that is not an issue (we did try to change the vm.heap flag on the VM but that does not seem to really affect the outcome), or we find a way to isolate the units (maybe each in its own VM / or a group of them) so that we don't incur in the issue

finally, one last resort would be to add a stack count in the interpreter, and flag that specific test so that the we put a limit on it.

in the meantime we could also just skip that one test

@evacchi
Copy link
Collaborator

evacchi commented Jan 23, 2025

FYI on my machine (M3 Pro) it takes 6m42s to run the full suite excluding SpecV1SkipStackGuardPageTest:

        testInstrumentationRunnerArguments["notClass"] = "com.dylibso.chicory.test.gen.SpecV1SkipStackGuardPageTest"

for some reason the task still fails but the suite ran successfully:

> Task :device-tests:connectedRuntimeTestsDebugAndroidTest FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':device-tests:connectedRuntimeTestsDebugAndroidTest'.
> There were failing tests. See the report at: file:///Users/evacchi/Devel/java/chicory/android-tests/device-tests/build/reports/androidTests/connected/debug/flavors/runtimeTests/index.html

* Try:
> Run with --stacktrace option to get the stack trace.
> Run with --debug option to get more log output.
> Run with --scan to get full insights.
> Get more help at https://help.gradle.org.

BUILD FAILED in 6m 42s
45 actionable tasks: 2 executed, 43 up-to-date
Watched directory hierarchies: [/Users/evacchi/Devel/java/chicory/android-tests]
Screenshot 2025-01-23 at 14 02 35

@andreaTP
Copy link
Collaborator Author

With the updated commits, especially after android:largeHeap=true, I have the full runtime-tests running to completion successfully.

Now the next issue is that, as @evacchi noticed too, the Android SDK disconnects/crash when trying to send back the tests results ( its reproducible in CI ).

@yigit any clue? Can you try to take a look?

@andreaTP
Copy link
Collaborator Author

@yigit I forgot to answer:

If we can make the maven task produce reproducible outputs (e.g. not snapshots and no timestamps in jars)

This is not going to happen unless we have a real world use-case for it.
It's extremely hard to achieve and caching, given the minimal number of dependencies of this project, is not highly valuable.

But until then, nightly seems fine.

I'll integrate the changes when we finalize this PR.

There is always the option to use firebase test lab to further speed them up.

Let's get to a working and more complete setup, and we iterate as we see the need.

@yigit
Copy link
Contributor

yigit commented Jan 24, 2025

First time I've seen that issue but looks like it was already reported:

https://issuetracker.google.com/issues/330756856

For short term, i think sharding is the easiest solution we have (assuming the junit5 runner doesn't have an issue with it).

Hopefully this weekend I'll have some time to play with this.
Given that this test runs by itself but not with others, I'm curious if there might be a leak somewhere.

but probably the issue is scoped to that particular test (so far) because it intentionally grows the virtual stack of the interpreter, causing a lot of garbage to be retained.

Is there a limit on the virtual stack or is it basically the limit of the host machine. If it is the limit of the host, yes we'll likely hit that no matter what. If it is bounded by the host process, it might make sense to limit it to (max memory - some buffer like a mb). But I'm guessing the same thing would be a problem on JVM. Maybe the overhead of the test runner is bigger as it has to keep everything in memory until reporting (about which i'm surprised but also not that surprised :) )

@yigit
Copy link
Contributor

yigit commented Jan 24, 2025

I've created 2 variants with sharding.

  • One of them just shards,
  • The other one moves the maven build to a separate step so others can re-use its output. This would reduce github resource usage but would likely increase CI times.

We also don't have to use matrix build and can simply invoke gradle multiple times in the same build to run shard by shard. We might need some massaging to prevent results from overriding each-other. We may also not want to shard small flavors (e.g. run connectedRuntimeDebugAndroidTest and then connectedRuntimeTestsDebugAndroidTest with shards)

The other 2 things in those branches are:

  • zip the results: GH-upload does not support : in file names, so we need to zip them before calling github upload action.
  • gradle scans: I've enabled gradle scans. Gradle doesn't really understand android tests but it will give some overall visibility to what is happening in the gradle build. Is just good to have for reference and diagnosis in the future.

It is failing with

java.lang.NullPointerException: Attempt to invoke virtual method 'com.dylibso.chicory.runtime.Instance$Exports com.dylibso.chicory.runtime.Instance.exports()' on a null object reference
at com.dylibso.chicory.test.gen.SpecV1AddressTest.test6(SpecV1AddressTest.java:105)
at java.lang.reflect.Method.invoke(Native Method)
at java.util.ArrayList.forEach(ArrayList.java:1262)
at java.util.ArrayList.forEach(ArrayList.java:1262)

like errors but I didn't get a chance to diagnose.

@andreaTP
Copy link
Collaborator Author

@yigit thanks a lot for getting back and all the information!

Is there a limit on the virtual stack or is it basically the limit of the host machine.

I did the research(specs/proposals etc.) and haven't found any good reference, looks like everyone is relying on the underlying runtime limit.
The interesting story is that the emulator was throwing this error:

java.lang.OutOfMemoryError: OutOfMemoryError thrown while trying to throw an exception; no stack trace available

already reported here.
Basically everything is working as expected, but, given the StackOverflowError contains a huge StackTrace, the emulator was crashing with OOM trying to fill it in 😅

i think sharding is the easiest solution

I'm fine with it if it's reproducible locally (and we can bundle the command to execute everything locally in a bash script).

Trying out the commands in your branch, sharding is trying to execute single tests methods on different shards and this is not going to work (that's the root cause of the NPE you noticed).
We discovered early that WebAssembly wast tests(for example) are not self-contained, they are dependent on the ordered executions of all the methods in each class (that's why we heavily depend on the Ordered annotation).
Basically, we should ensure that sharding is keeping classes and not methods as the minimal split unit.

Regarding "zipping the results" and "Gradle scans," I truly appreciate your enthusiasm and your drive to make things better—thank you!

That said, it’s important to keep the scope in mind. The Gradle build we use for running tests on Android serves as a basic sanity check. The simpler it is, the better.

More specifically:

  • Test results: These tests are focused solely on ensuring compatibility with Android. I don’t anticipate gaining deep insights from the results. In the case of a failure, the primary need is simply the ability to reproduce it locally.
  • Gradle scans: My goal is for us to have a setup that is stable and rarely needs updating or fixing. Once we achieve that, debugging the setup should ideally not be necessary.

Thank you again for your efforts—I really value your commitment and thoughtfulness!

@andreaTP
Copy link
Collaborator Author

Forgot to follow up regarding the build matrix:
I think that we can keep a single execution testing only the stricter requirement, @yigit do you see any issue with this approach?

@yigit
Copy link
Contributor

yigit commented Jan 24, 2025

I'm afraid android test runner does not support sharding by class 😞 .
We could consider doing it in Gradle. Since we already have a jar, it wouldn't be hard to partition that by class into multiple jars (and multiple flavors). Obviously, that would mean more complex build logic in gradle but I'm not sure how easy that is in maven. Maybe you can create a maven task to do that ?

I think that we can keep a single execution testing only the stricter requirement, @yigit do you see any issue with this approach?

Not sure if I understand. You mean invoking Gradle multiple times per shard? We can definitely extract this to a shell script or a single Gradle task (might be nicer for local analysis as Studio would discover that as well, so one can debug a specific failure). But I'm guessing shell script will be easier for contributors.

Also, about the test results/scans:
I think tests results are important as it immediately tells what failed in a nice html page. I can remove the xml reports though (i think those are the ones creating the problem).

Scans are good if we want to ever optimize this etc but doesn't really matter (it also doesn't cost to publish them for open source projects).

But we can remove either of these (though I would recommend keeping test results).

@yigit
Copy link
Contributor

yigit commented Jan 25, 2025

I played a bit more with this, sharding by class etc.
But just the SpecV1MemoryCopyTest is big enough that it hits the grpc limit by itself. The limit might be per class, i've not checked what is really going on there.

Maybe that can be avoided by changing codegen to divide tests but based on the order problem you've mentioned, I'm not even sure if that is feasible.

I think the only viable option might be to generate fat tests for android. e.g. JavaTestGen could generate tests that look like this, when invoked for android.

class GeneratedTest {
    // @Test() // remote test annoatations from test methods
    // @Order(1)
     // @DisplayName("memory_copy.wast:15")
    public void test1() {
        var varTest = testModule0Instance.exports().function("test");
        var exception = assertDoesNotThrow(() -> varTest.apply());
    }
    ...
    @Test // have 1 test to call all others in the right order.
    @DisplayName("run all tests in wasmFileName")
    public void runAllTests() {
        // maybe call instantiates here or keep relying on order for them
        test1();
        test2();
        test3();
    }
}

Not sure if that is something you would be interested in doing but I'm a bit out of options here :'(. Sorry, i didn't expect android support to get so complicated :/.

@evacchi
Copy link
Collaborator

evacchi commented Jan 25, 2025

no worries, all your suggestions are very welcome :D

keep a single execution testing

I think what @andreaTP meant is "a single command to run the entire suite", but let's hear from him

I think the only viable option might be to generate fat tests for android. e.g. JavaTestGen could generate tests that look like this, when invoked for android.

I don't know what you guys think of this approach, but if we know what tests cause the problem (and I believe they will be a handful) maybe we could just exclude them from the main run and then create a task that run those separately?

e.g.

testAll() {
   runFirstBatch()
   runSecondBatch()
   // if necessary, more batches
}

we could also "tag" the tests with an annotation (e.g. @LargeTest), if the number of methods is above some threshold

@andreaTP
Copy link
Collaborator Author

Quickly answering from the phone.

You mean invoking Gradle multiple times per shard?

No, I mean removing the matrix for Android SDK and test only on 33.

Re: test results, I'm happy to keep whatever comes for free, e.g. a step in CI is fine.
If it requires work or a complicated setup we should try to avoid it.

Re: scans, no strong opinions, the Gradle build is a side build for this project and I'm not sure how I would action a feedback from the scans.

is big enough that it hits the grpc limit by itself.

This is really worrying!
Some Simd tests are generating more than 3 Mb of source code... we are going to hit the limit anyhow.
Do you know where to open a relevant issue about this bug?
The grpc limit should be tweakable with properties somewhere...

Maybe you can create a maven task to do that ?

The current test gen is already able to split the generation based on configuration, but I don't think we should invest in a complex setup there.

All in all, I think that the priority here is to open a bug report and try to overcome the grpc limit/client disconnection (this is the biggest blocker).

Second we should start considering running a few smoke tests instead of the full TCK.
I'm thinking about something like machine-test running a real world wasm workload, using the interpreter and the AOT at compile time, including Wasi etc.

@andreaTP
Copy link
Collaborator Author

Sorry, i didn't expect android support to get so complicated :/.

@yigit I was honestly expecting some hiccups :-) and I'm grateful to have someone knowledgeable like you assisting in the process. Thanks a ton and no worries!

@yigit
Copy link
Contributor

yigit commented Jan 25, 2025

I noticed that bug somehow didn't get triaged and was reported in the wrong component. I moved it to the Android Gradle Plugin and will ping someone during the week.

I'm afraid this is the code that creates that grpc client:

And it doesn't receive any configuration parameters.
I'll check w/ the gradle team later in the week, meanwhile, I like the annotation solution from @evacchi . FYI we don't need to use android annotations (e.g. LargeTest), we can make up an annotation and filter them out while running tests (or "shard" runs using annotations).

Are the machine-tests more end to end tests? If so, definitely worth prioritizing them.

@andreaTP
Copy link
Collaborator Author

Separating the tests in groups with annotations is definitely possible and easy to implement 👍 as soon as we verify that it can work I volunteer to make the changes (e.g. @Group1 @Group2 etc.).

machine-tests are running QuickJs(compiling and running a simple Js example) with Wasi and mixing "Machines".
For Android we should be able to check compile time AOT and the interpreter.
The idea is: if a JS compiler and runtime works we have some confidence that things will work on Android, and we can treat the rest as bugs increasing coverage when needed.

@andreaTP
Copy link
Collaborator Author

And it doesn't receive any configuration parameters.

Usually it is possible to use env vars and global properties anyhow, I'll check on Mon.

andreaTP added a commit that referenced this pull request Jan 30, 2025
Investigating #746 I noticed that the resizes of the `ArrayList` were
consuming a lot of memory.
@andreaTP
Copy link
Collaborator Author

@yigit no pressure and no timelines, just curious if you managed to keep the ball rolling on this subject and you have any kind of news 🙂 .

I'd be happy to help, but this is all very new for me, and I'm unsure where to start, for example, to try to tweak the grpc limit. Any link/information will be much appreciated!

@yigit
Copy link
Contributor

yigit commented Feb 1, 2025

Hey sorry i forgot to update here but i did re-route it internally. The team is looking into it but it being fixed then shipping in a version that we can use is going to be a long time :'(

And there doesn't seem to be a workaround.

I think our 2 options here are:

  • don't run big tests
  • create fat tests that run all off these 1 by 1 without them being individual tests.

If (2) is not too complex of a code-generator change, I think that is the better option so we get coverage without waiting for another AGP release. I can give it a try if you are open to that solution.

@andreaTP
Copy link
Collaborator Author

andreaTP commented Feb 1, 2025

Thanks a lot for getting back @yigit !

I can give it a try if you are open to that solution.

Given the past conversation where you mentioned that even some single test classes are exceeding the limits I'm not sure how it will work.
That said, if you have something in mind, please go ahead with a poc and we iterate on it, I have no specific concerns with the approach.

In my mind, after #748 , you should be able to avoid generating the test jars, and directly invoke the codegen from Gradle to materialize test classes on the fly.
In this way you have more granular control over what will execute.
Again, this is merely "one way" of doing things, and I'm happy to support any viable alternative.

If everything fails for running the testsuite we can always fallback to smoke testing, in this case the next step would be to include the manually written wasi tests as they are running some popular languages "hello world" samples.

@yigit
Copy link
Contributor

yigit commented Feb 1, 2025

Given the past conversation where you mentioned that even some single test classes are exceeding the limits I'm not sure how it will work. That said, if you have something in mind, please go ahead with a poc and we iterate on it, I have no specific concerns with the approach.
Oh i think that was a mis-understanding. Just running 1 test where each test case is a separate @Test doesn't work (hits the limit). If we merge them into 1 test, it should work (i assume, haven't tried).

In my mind, after #748 , you should be able to avoid generating the test jars, and directly invoke the codegen from Gradle to materialize test classes on the fly. In this way you have more granular control over what will execute. Again, this is merely "one way" of doing things, and I'm happy to support any viable alternative.
I'll take a look. So this is basically import that test-gen library from gradle (like a regular dependency for the build) and then invoke it directly with proper source sets?

@andreaTP
Copy link
Collaborator Author

andreaTP commented Feb 1, 2025

If we merge them into 1 test, it should work (i assume, haven't tried).

This should be doable but a bit of work ... I'd try to avoid it if possible.

Re Gradle, I was thinking about doing something like this:

fabric8io/kubernetes-client#4829 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants