WIP - run 'runtime-tests' on Android #746

andreaTP · 2025-01-22T17:12:19Z

Opening for visibility @yigit and @evacchi .

I'm not done with this:

I have had headaches trying to get a correct classpath for wasi (Wat2Wasm), decided to give up and disabled the tests, @yigit help here will be appreciated 🙏
I keep having an OOM that causes a failure on the skip-stack-guard-page test (it's checking the call stack exhausted)
The tests are taking considerable time to run now, I'd like to consider running only a linter on PR and the full tests as a nightly

Thoughts?

yigit · 2025-01-22T17:35:25Z

Have you tried running this locally?

./gradlew device-tests:connectedRuntimeTestsDebugAndroidTest -Pandroid.testInstrumentationRunnerArguments.class=com.dylibso.chicory.test.gen.SpecV1SkipStackGuardPageTest#skip-stack-guard-page.wast

I'm wondering if we are leaking some memory during tests or it is just this test itself.

evacchi · 2025-01-22T17:47:28Z

just curious, I was expecting the time to increase a lot, how long is it taking now? Anyway, +1 on running nightly and/or on-demand.

andreaTP · 2025-01-22T18:27:21Z

Have you tried running this locally?

Yes, same result, but thanks for the command line, it's helpful to continue debugging.

I was expecting the time to increase a lot

It takes approx 10 minutes to get to the end of the tests.

evacchi · 2025-01-22T20:59:12Z

Not too bad after all :) but it still probably makes sense to make it optional/nightly for now

yigit · 2025-01-23T00:12:08Z

I can take a look at the test time eventually.
If we can make the maven task produce reproducible outputs (e.g. not snapshots and no timestamps in jars), we should be able to setup gradle cache to avoid unnecessary runs.

We can also shard and/or run tests in parallel.
But until then, nightly seems fine.

There is always the option to use firebase test lab to further speed them up. It will not cost much, but that would be a consideration for later. Just mentioning it for reference that we have options if we need these tests to run fast.

evacchi · 2025-01-23T11:45:20Z

./gradlew device-tests:connectedRuntimeTestsDebugAndroidTest -Pandroid.testInstrumentationRunnerArguments.class=com.dylibso.chicory.test.gen.SpecV1SkipStackGuardPageTest#skip-stack-guard-page.wast

yes! the test actually passes with:

./gradlew device-tests:connectedRuntimeTestsDebugAndroidTest -Pandroid.testInstrumentationRunnerArguments.class=com.dylibso.chicory.test.gen.SpecV1SkipStackGuardPageTest

so we pair-debugged this a little and it looks like the main issue is memory pressure because a lot of tests are run in a small amount of time. We tried adding a Runtime.getRuntime().gc() in a strategic place (TestModule) but probably the issue is scoped to that particular test (so far) because it intentionally grows the virtual stack of the interpreter, causing a lot of garbage to be retained.

so, either we find a way to setup the VM so that is not an issue (we did try to change the vm.heap flag on the VM but that does not seem to really affect the outcome), or we find a way to isolate the units (maybe each in its own VM / or a group of them) so that we don't incur in the issue

finally, one last resort would be to add a stack count in the interpreter, and flag that specific test so that the we put a limit on it.

in the meantime we could also just skip that one test

evacchi · 2025-01-23T13:03:00Z

FYI on my machine (M3 Pro) it takes 6m42s to run the full suite excluding SpecV1SkipStackGuardPageTest:

        testInstrumentationRunnerArguments["notClass"] = "com.dylibso.chicory.test.gen.SpecV1SkipStackGuardPageTest"

for some reason the task still fails but the suite ran successfully:

> Task :device-tests:connectedRuntimeTestsDebugAndroidTest FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':device-tests:connectedRuntimeTestsDebugAndroidTest'.
> There were failing tests. See the report at: file:///Users/evacchi/Devel/java/chicory/android-tests/device-tests/build/reports/androidTests/connected/debug/flavors/runtimeTests/index.html

* Try:
> Run with --stacktrace option to get the stack trace.
> Run with --debug option to get more log output.
> Run with --scan to get full insights.
> Get more help at https://help.gradle.org.

BUILD FAILED in 6m 42s
45 actionable tasks: 2 executed, 43 up-to-date
Watched directory hierarchies: [/Users/evacchi/Devel/java/chicory/android-tests]

andreaTP · 2025-01-23T17:25:58Z

With the updated commits, especially after android:largeHeap=true, I have the full runtime-tests running to completion successfully.

Now the next issue is that, as @evacchi noticed too, the Android SDK disconnects/crash when trying to send back the tests results ( its reproducible in CI ).

@yigit any clue? Can you try to take a look?

andreaTP · 2025-01-23T17:35:00Z

@yigit I forgot to answer:

If we can make the maven task produce reproducible outputs (e.g. not snapshots and no timestamps in jars)

This is not going to happen unless we have a real world use-case for it.
It's extremely hard to achieve and caching, given the minimal number of dependencies of this project, is not highly valuable.

But until then, nightly seems fine.

I'll integrate the changes when we finalize this PR.

There is always the option to use firebase test lab to further speed them up.

Let's get to a working and more complete setup, and we iterate as we see the need.

yigit · 2025-01-24T06:03:55Z

First time I've seen that issue but looks like it was already reported:

https://issuetracker.google.com/issues/330756856

For short term, i think sharding is the easiest solution we have (assuming the junit5 runner doesn't have an issue with it).

Hopefully this weekend I'll have some time to play with this.
Given that this test runs by itself but not with others, I'm curious if there might be a leak somewhere.

but probably the issue is scoped to that particular test (so far) because it intentionally grows the virtual stack of the interpreter, causing a lot of garbage to be retained.

Is there a limit on the virtual stack or is it basically the limit of the host machine. If it is the limit of the host, yes we'll likely hit that no matter what. If it is bounded by the host process, it might make sense to limit it to (max memory - some buffer like a mb). But I'm guessing the same thing would be a problem on JVM. Maybe the overhead of the test runner is bigger as it has to keep everything in memory until reporting (about which i'm surprised but also not that surprised :) )

yigit · 2025-01-24T07:09:36Z

I've created 2 variants with sharding.

One of them just shards,
The other one moves the maven build to a separate step so others can re-use its output. This would reduce github resource usage but would likely increase CI times.

We also don't have to use matrix build and can simply invoke gradle multiple times in the same build to run shard by shard. We might need some massaging to prevent results from overriding each-other. We may also not want to shard small flavors (e.g. run connectedRuntimeDebugAndroidTest and then connectedRuntimeTestsDebugAndroidTest with shards)

The other 2 things in those branches are:

zip the results: GH-upload does not support : in file names, so we need to zip them before calling github upload action.
gradle scans: I've enabled gradle scans. Gradle doesn't really understand android tests but it will give some overall visibility to what is happening in the gradle build. Is just good to have for reference and diagnosis in the future.

It is failing with

java.lang.NullPointerException: Attempt to invoke virtual method 'com.dylibso.chicory.runtime.Instance$Exports com.dylibso.chicory.runtime.Instance.exports()' on a null object reference
at com.dylibso.chicory.test.gen.SpecV1AddressTest.test6(SpecV1AddressTest.java:105)
at java.lang.reflect.Method.invoke(Native Method)
at java.util.ArrayList.forEach(ArrayList.java:1262)
at java.util.ArrayList.forEach(ArrayList.java:1262)

like errors but I didn't get a chance to diagnose.

andreaTP · 2025-01-24T09:49:21Z

@yigit thanks a lot for getting back and all the information!

Is there a limit on the virtual stack or is it basically the limit of the host machine.

I did the research(specs/proposals etc.) and haven't found any good reference, looks like everyone is relying on the underlying runtime limit.
The interesting story is that the emulator was throwing this error:

java.lang.OutOfMemoryError: OutOfMemoryError thrown while trying to throw an exception; no stack trace available

already reported here.
Basically everything is working as expected, but, given the StackOverflowError contains a huge StackTrace, the emulator was crashing with OOM trying to fill it in 😅

i think sharding is the easiest solution

I'm fine with it if it's reproducible locally (and we can bundle the command to execute everything locally in a bash script).

Trying out the commands in your branch, sharding is trying to execute single tests methods on different shards and this is not going to work (that's the root cause of the NPE you noticed).
We discovered early that WebAssembly wast tests(for example) are not self-contained, they are dependent on the ordered executions of all the methods in each class (that's why we heavily depend on the Ordered annotation).
Basically, we should ensure that sharding is keeping classes and not methods as the minimal split unit.

Regarding "zipping the results" and "Gradle scans," I truly appreciate your enthusiasm and your drive to make things better—thank you!

That said, it’s important to keep the scope in mind. The Gradle build we use for running tests on Android serves as a basic sanity check. The simpler it is, the better.

More specifically:

Test results: These tests are focused solely on ensuring compatibility with Android. I don’t anticipate gaining deep insights from the results. In the case of a failure, the primary need is simply the ability to reproduce it locally.
Gradle scans: My goal is for us to have a setup that is stable and rarely needs updating or fixing. Once we achieve that, debugging the setup should ideally not be necessary.

Thank you again for your efforts—I really value your commitment and thoughtfulness!

andreaTP · 2025-01-24T09:53:38Z

Forgot to follow up regarding the build matrix:
I think that we can keep a single execution testing only the stricter requirement, @yigit do you see any issue with this approach?

yigit · 2025-01-24T23:02:13Z

I'm afraid android test runner does not support sharding by class 😞 .
We could consider doing it in Gradle. Since we already have a jar, it wouldn't be hard to partition that by class into multiple jars (and multiple flavors). Obviously, that would mean more complex build logic in gradle but I'm not sure how easy that is in maven. Maybe you can create a maven task to do that ?

I think that we can keep a single execution testing only the stricter requirement, @yigit do you see any issue with this approach?

Not sure if I understand. You mean invoking Gradle multiple times per shard? We can definitely extract this to a shell script or a single Gradle task (might be nicer for local analysis as Studio would discover that as well, so one can debug a specific failure). But I'm guessing shell script will be easier for contributors.

Also, about the test results/scans:
I think tests results are important as it immediately tells what failed in a nice html page. I can remove the xml reports though (i think those are the ones creating the problem).

Scans are good if we want to ever optimize this etc but doesn't really matter (it also doesn't cost to publish them for open source projects).

But we can remove either of these (though I would recommend keeping test results).

yigit · 2025-01-25T08:42:29Z

I played a bit more with this, sharding by class etc.
But just the SpecV1MemoryCopyTest is big enough that it hits the grpc limit by itself. The limit might be per class, i've not checked what is really going on there.

Maybe that can be avoided by changing codegen to divide tests but based on the order problem you've mentioned, I'm not even sure if that is feasible.

I think the only viable option might be to generate fat tests for android. e.g. JavaTestGen could generate tests that look like this, when invoked for android.

class GeneratedTest {
    // @Test() // remote test annoatations from test methods
    // @Order(1)
     // @DisplayName("memory_copy.wast:15")
    public void test1() {
        var varTest = testModule0Instance.exports().function("test");
        var exception = assertDoesNotThrow(() -> varTest.apply());
    }
    ...
    @Test // have 1 test to call all others in the right order.
    @DisplayName("run all tests in wasmFileName")
    public void runAllTests() {
        // maybe call instantiates here or keep relying on order for them
        test1();
        test2();
        test3();
    }
}

Not sure if that is something you would be interested in doing but I'm a bit out of options here :'(. Sorry, i didn't expect android support to get so complicated :/.

evacchi · 2025-01-25T10:11:26Z

no worries, all your suggestions are very welcome :D

keep a single execution testing

I think what @andreaTP meant is "a single command to run the entire suite", but let's hear from him

I think the only viable option might be to generate fat tests for android. e.g. JavaTestGen could generate tests that look like this, when invoked for android.

I don't know what you guys think of this approach, but if we know what tests cause the problem (and I believe they will be a handful) maybe we could just exclude them from the main run and then create a task that run those separately?

e.g.

testAll() {
   runFirstBatch()
   runSecondBatch()
   // if necessary, more batches
}

we could also "tag" the tests with an annotation (e.g. @LargeTest), if the number of methods is above some threshold

andreaTP · 2025-01-25T10:34:03Z

Quickly answering from the phone.

You mean invoking Gradle multiple times per shard?

No, I mean removing the matrix for Android SDK and test only on 33.

Re: test results, I'm happy to keep whatever comes for free, e.g. a step in CI is fine.
If it requires work or a complicated setup we should try to avoid it.

Re: scans, no strong opinions, the Gradle build is a side build for this project and I'm not sure how I would action a feedback from the scans.

is big enough that it hits the grpc limit by itself.

This is really worrying!
Some Simd tests are generating more than 3 Mb of source code... we are going to hit the limit anyhow.
Do you know where to open a relevant issue about this bug?
The grpc limit should be tweakable with properties somewhere...

Maybe you can create a maven task to do that ?

The current test gen is already able to split the generation based on configuration, but I don't think we should invest in a complex setup there.

All in all, I think that the priority here is to open a bug report and try to overcome the grpc limit/client disconnection (this is the biggest blocker).

Second we should start considering running a few smoke tests instead of the full TCK.
I'm thinking about something like machine-test running a real world wasm workload, using the interpreter and the AOT at compile time, including Wasi etc.

andreaTP · 2025-01-25T10:35:57Z

Sorry, i didn't expect android support to get so complicated :/.

@yigit I was honestly expecting some hiccups :-) and I'm grateful to have someone knowledgeable like you assisting in the process. Thanks a ton and no worries!

yigit · 2025-01-25T18:20:05Z

I noticed that bug somehow didn't get triaged and was reported in the wrong component. I moved it to the Android Gradle Plugin and will ping someone during the week.

I'm afraid this is the code that creates that grpc client:

And it doesn't receive any configuration parameters.
I'll check w/ the gradle team later in the week, meanwhile, I like the annotation solution from @evacchi . FYI we don't need to use android annotations (e.g. LargeTest), we can make up an annotation and filter them out while running tests (or "shard" runs using annotations).

Are the machine-tests more end to end tests? If so, definitely worth prioritizing them.

andreaTP · 2025-01-25T19:59:33Z

Separating the tests in groups with annotations is definitely possible and easy to implement 👍 as soon as we verify that it can work I volunteer to make the changes (e.g. @Group1 @Group2 etc.).

machine-tests are running QuickJs(compiling and running a simple Js example) with Wasi and mixing "Machines".
For Android we should be able to check compile time AOT and the interpreter.
The idea is: if a JS compiler and runtime works we have some confidence that things will work on Android, and we can treat the rest as bugs increasing coverage when needed.

andreaTP · 2025-01-25T20:01:44Z

And it doesn't receive any configuration parameters.

Usually it is possible to use env vars and global properties anyhow, I'll check on Mon.

Investigating #746 I noticed that the resizes of the `ArrayList` were consuming a lot of memory.

andreaTP · 2025-01-31T16:16:53Z

@yigit no pressure and no timelines, just curious if you managed to keep the ball rolling on this subject and you have any kind of news 🙂 .

I'd be happy to help, but this is all very new for me, and I'm unsure where to start, for example, to try to tweak the grpc limit. Any link/information will be much appreciated!

yigit · 2025-02-01T00:50:24Z

Hey sorry i forgot to update here but i did re-route it internally. The team is looking into it but it being fixed then shipping in a version that we can use is going to be a long time :'(

And there doesn't seem to be a workaround.

I think our 2 options here are:

don't run big tests
create fat tests that run all off these 1 by 1 without them being individual tests.

If (2) is not too complex of a code-generator change, I think that is the better option so we get coverage without waiting for another AGP release. I can give it a try if you are open to that solution.

andreaTP · 2025-02-01T09:39:31Z

Thanks a lot for getting back @yigit !

I can give it a try if you are open to that solution.

Given the past conversation where you mentioned that even some single test classes are exceeding the limits I'm not sure how it will work.
That said, if you have something in mind, please go ahead with a poc and we iterate on it, I have no specific concerns with the approach.

In my mind, after #748 , you should be able to avoid generating the test jars, and directly invoke the codegen from Gradle to materialize test classes on the fly.
In this way you have more granular control over what will execute.
Again, this is merely "one way" of doing things, and I'm happy to support any viable alternative.

If everything fails for running the testsuite we can always fallback to smoke testing, in this case the next step would be to include the manually written wasi tests as they are running some popular languages "hello world" samples.

yigit · 2025-02-01T17:54:46Z

Given the past conversation where you mentioned that even some single test classes are exceeding the limits I'm not sure how it will work. That said, if you have something in mind, please go ahead with a poc and we iterate on it, I have no specific concerns with the approach.
Oh i think that was a mis-understanding. Just running 1 test where each test case is a separate @Test doesn't work (hits the limit). If we merge them into 1 test, it should work (i assume, haven't tried).

In my mind, after #748 , you should be able to avoid generating the test jars, and directly invoke the codegen from Gradle to materialize test classes on the fly. In this way you have more granular control over what will execute. Again, this is merely "one way" of doing things, and I'm happy to support any viable alternative.
I'll take a look. So this is basically import that test-gen library from gradle (like a regular dependency for the build) and then invoke it directly with proper source sets?

andreaTP · 2025-02-01T18:26:04Z

If we merge them into 1 test, it should work (i assume, haven't tried).

This should be doable but a bit of work ... I'd try to avoid it if possible.

Re Gradle, I was thinking about doing something like this:

fabric8io/kubernetes-client#4829 (comment)

WIP - run 'runtime-tests' on Android

7e9b493

andreaTP requested a review from evacchi January 22, 2025 17:12

andreaTP mentioned this pull request Jan 22, 2025

Run Chicory tests on Android #743

Merged

andreaTP added 2 commits January 23, 2025 16:06

wip - improvements, check in ci

5b69725

remove disabled tests

333396f

andreaTP mentioned this pull request Jan 30, 2025

minor optimization of StackFrame instantiation #749

Merged

andreaTP added a commit that referenced this pull request Jan 30, 2025

minor optimization of StackFrame instantiation (#749)

d515930

Investigating #746 I noticed that the resizes of the `ArrayList` were consuming a lot of memory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP - run 'runtime-tests' on Android #746

WIP - run 'runtime-tests' on Android #746

andreaTP commented Jan 22, 2025

yigit commented Jan 22, 2025

evacchi commented Jan 22, 2025

andreaTP commented Jan 22, 2025

evacchi commented Jan 22, 2025 •

edited

Loading

yigit commented Jan 23, 2025

evacchi commented Jan 23, 2025 •

edited

Loading

evacchi commented Jan 23, 2025

andreaTP commented Jan 23, 2025

andreaTP commented Jan 23, 2025

yigit commented Jan 24, 2025

yigit commented Jan 24, 2025

andreaTP commented Jan 24, 2025

andreaTP commented Jan 24, 2025

yigit commented Jan 24, 2025

yigit commented Jan 25, 2025

evacchi commented Jan 25, 2025

andreaTP commented Jan 25, 2025

andreaTP commented Jan 25, 2025

yigit commented Jan 25, 2025 •

edited

Loading

andreaTP commented Jan 25, 2025

andreaTP commented Jan 25, 2025

andreaTP commented Jan 31, 2025

yigit commented Feb 1, 2025

andreaTP commented Feb 1, 2025

yigit commented Feb 1, 2025

andreaTP commented Feb 1, 2025

WIP - run 'runtime-tests' on Android #746

Are you sure you want to change the base?

WIP - run 'runtime-tests' on Android #746

Conversation

andreaTP commented Jan 22, 2025

yigit commented Jan 22, 2025

evacchi commented Jan 22, 2025

andreaTP commented Jan 22, 2025

evacchi commented Jan 22, 2025 • edited Loading

yigit commented Jan 23, 2025

evacchi commented Jan 23, 2025 • edited Loading

evacchi commented Jan 23, 2025

andreaTP commented Jan 23, 2025

andreaTP commented Jan 23, 2025

yigit commented Jan 24, 2025

yigit commented Jan 24, 2025

andreaTP commented Jan 24, 2025

andreaTP commented Jan 24, 2025

yigit commented Jan 24, 2025

yigit commented Jan 25, 2025

evacchi commented Jan 25, 2025

andreaTP commented Jan 25, 2025

andreaTP commented Jan 25, 2025

yigit commented Jan 25, 2025 • edited Loading

andreaTP commented Jan 25, 2025

andreaTP commented Jan 25, 2025

andreaTP commented Jan 31, 2025

yigit commented Feb 1, 2025

andreaTP commented Feb 1, 2025

yigit commented Feb 1, 2025

andreaTP commented Feb 1, 2025

evacchi commented Jan 22, 2025 •

edited

Loading

evacchi commented Jan 23, 2025 •

edited

Loading

yigit commented Jan 25, 2025 •

edited

Loading