-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deleter: inconsistency after AGNOS11 update #33993
Comments
Got any routes or dongles? |
only some dongs, added to description |
Got this error on this route: https://connect.comma.ai/fe18f736cb0d7813/0000040d--3a202cd3b9 It actually increased by 1% after I couldn't engage anymore either. Logs/data are uploading. I pretty much only run master-ci so this was surprising. |
got one segment when it's happened: Interestingly when I connected to the cell phone hotspot to ssh and run the deleter test script, the deleter worked the moment the connection was established without me running any command and the out of storage event cleared, when I get home I upload the route |
My routes don't appear in connect so can't really help there, but got it with dongle a954cbd0682c58bd |
Yup mine's doing the same too. Hoping the new build of AGNOS in master will address... My dongle ID is likely useless as I don't have a data connection anymore with my C3. |
The new AGNOS doesn't address this. Just looking into this now. So far seems like it's all devices with NVMe drives, so likely a race condition (or failure in) mounting it. Have you guys gotten the "failed to mount" alert at all? |
Thanks for confirming! |
I have not seen that error. |
I also have not seen that error too. |
grab more one: a0e5673b89222047/0000003d--2ac5e373e1/8 or in a0e5673b89222047/0000003c--0133a69a87/3 according user description |
Not sure if this is related but my C3X wasn't uploading anything. After deleting realdata it started uploading again. |
Just started looking into this:
|
I missed it when I did the AGNOS flash in the sound issue test |
got some routes: 075b133b6181e058/0000003e--67e218c522/1 075b133b6181e058/0000003a--3043e775b6/25 |
Posting links to relevant code here: The code that disables at 7% free: openpilot/selfdrive/selfdrived/selfdrived.py Lines 189 to 190 in 75bf756
The code that should have started deleting stuff when things reach 10% free: https://github.com/commaai/openpilot/blob/master/system/loggerd/deleter.py |
I've made a somewhat pinned branch of master-ci that makes deleter log like hell with rotation and an extra logging to see if deleter dies to something that's not an https://github.com/nelsonjchen/openpilot/tree/33993_master-ci (I'll rebase it onto a newer master-ci whenever I feel like it.) It'll be logging onto the SOM's storage, but that's good since it'll be separate from the NVMe drive. That, combined with Anyway, just mentioning my methodology if anyone else wants to follow along. |
Looked into it a bit more, and it's not actually the NVMe. Seems to be related to internet connectivity like @AlexandreSato mentioned. Curious what you find @nelsonjchen. |
Ooh, OK. I've added a bit more logging around sus stuff with "cloud" in the name like cloudlog then. |
I can't admit I've driven too much but so far, no hits on the issue. Maybe I need to try a full shutdown/power up away from wifi/internets. |
@AlexandreSato , anybody who can reliably hit this? maybe they can run the hacky logging patch? |
I think the best way would be to use replay, recreating logs in a loop. Lately, I've been connecting to my phone's hotspot to work around the issue. |
Unable to reproduce it so far. I guess my habits just aren't compatible. Hey, those who can reproduce it consistently, please cherry-pick my commit and share over some logs. 🤞 I don't care if you're clone, fork, uploads ignored, or whatever. Just help. |
Should be very easy to reproduce, happens to me 100% of the time if there is no network connection. |
Thanks @korhojoa . Let me know if you need any changes or modifications to the branch as well. |
Ive started a thread in the discord: https://discord.com/channels/469524606043160576/1322679694889324616 |
Once the available space reaches 10% or 5GB, it will delete 1 or 2 segments to free up space, but it won't delete more than that so that we keep as many logs as possible. It will only delete more once new segments are created (so while driving it will delete one every minute). |
log.zip |
It's happening due to a time warp, which can be seen in @korhojoa's logs. The clock is going back to 2015, trigger unknown but isn't supposed to matter. Threading event.wait() is supposed to use a monotonic clock and therefore be robust to that, but it isn't. @incognitojam found python/cpython#91394 which may describe the issue. To reproduce on device, apply this patch: diff --git a/system/loggerd/deleter.py b/system/loggerd/deleter.py
index 2a2bd5fe1..44e796a62 100755
--- a/system/loggerd/deleter.py
+++ b/system/loggerd/deleter.py
@@ -69,7 +69,8 @@ def deleter_thread(exit_event: threading.Event):
cloudlog.exception(f"issue deleting {delete_path}")
exit_event.wait(.1)
else:
- exit_event.wait(30)
+ cloudlog.info("Sleeping")
+ exit_event.wait(5) Then launch
Then tamper with the system time in another terminal:
You will stop seeing "Sleeping" output from deleter, but it will resume when the clock is fixed:
This explains why deleter recovers when we get Internet connectivity, because The correct fix is not yet clear to me. There are a number of hacks we could apply, but ideally if Python is supposed to be using a monotonic clock for Event.wait(), we should find out why it isn't and fix that. |
Following up to myself: The same problem cannot be reproduced on release; the threading wait() call doesn't seem vulnerable to system time changes. It remains to be determined what changed in Python or AGNOS that broke monotonic timekeeping, but having identified that it's a regression, it should be fixable somehow. Correct behavior with:
Wrong behavior with:
|
Very interesting! Should be easy to remove this use of RE the NVMe assumption: it may have just mostly affected comma three triggered by #34217. |
If this is really a Python bug, not sure there's much we can do here besides block the usage of Line 255 in 2258ea3
|
Python clearly goes to some effort to use a monotonic clock wherever available, but there are also platforms where there isn't one, and it may depend on underlying things like glibc. It's kind of a rabbit hole. https://bugs.python.org/issue23428 It's definitely possible to fix our Python, it used to work in AGNOS 10, and it works on desktop. The question is whether to spend the time tracing down why only the AGNOS 11 build is busted, or do we take the cheap way out. If we take the cheap way out, we need to make sure clock rollbacks are included in what runs on the comma test rack. TBH that should probably happen anyway. |
AGNOS 11.5 #34413 bumps our Python to 3.12.8. Unfortunately, it's still not fixed there, and we can't go to 3.13 yet. It's possible it's just an issue with uv's Python builds (https://github.com/astral-sh/python-build-standalone). |
I suspect exactly that, think an upstream build config oops like we had with libzmq. |
Describe the bug
I suspect that the deleter after AGNOS11 update is no more consistent. I haven't tested it yet to confirm, but I've already caught four people who had to delete realdata/* manually (because a noEnter "out of storage" event)
Provide a route where the issue occurs
dongleIDs: 0b54d0594d924cd9, 075b133b6181e058, a0e5673b89222047
075b133b6181e058/000000cc--7d1fff3f36/42
openpilot version
1376096
Additional info
Not confirmed
The text was updated successfully, but these errors were encountered: