deleter: inconsistency after AGNOS11 update #33993

AlexandreSato · 2024-11-12T00:05:43Z

Describe the bug

I suspect that the deleter after AGNOS11 update is no more consistent. I haven't tested it yet to confirm, but I've already caught four people who had to delete realdata/* manually (because a noEnter "out of storage" event)

Provide a route where the issue occurs

dongleIDs: 0b54d0594d924cd9, 075b133b6181e058, a0e5673b89222047

075b133b6181e058/000000cc--7d1fff3f36/42

openpilot version

1376096

Additional info

Not confirmed

The text was updated successfully, but these errors were encountered:

adeebshihadeh · 2024-11-12T00:40:33Z

Got any routes or dongles?

AlexandreSato · 2024-11-12T23:28:17Z

Got any routes or dongles?

only some dongs, added to description

nelsonjchen · 2024-11-13T04:52:20Z

Got this error on this route:

https://connect.comma.ai/fe18f736cb0d7813/0000040d--3a202cd3b9

It actually increased by 1% after I couldn't engage anymore either. Logs/data are uploading.

I pretty much only run master-ci so this was surprising.

AlexandreSato · 2024-11-13T22:21:19Z

got one segment when it's happened:
075b133b6181e058/000000cc--7d1fff3f36/42

Interestingly when I connected to the cell phone hotspot to ssh and run the deleter test script, the deleter worked the moment the connection was established without me running any command and the out of storage event cleared, when I get home I upload the route

korhojoa · 2024-11-17T11:10:58Z

My routes don't appear in connect so can't really help there, but got it with dongle a954cbd0682c58bd

michaelhonan · 2024-11-18T04:38:24Z

Yup mine's doing the same too. Hoping the new build of AGNOS in master will address...

My dongle ID is likely useless as I don't have a data connection anymore with my C3.
dc30ef2ad294a732

adeebshihadeh · 2024-11-18T04:42:20Z

The new AGNOS doesn't address this. Just looking into this now.

So far seems like it's all devices with NVMe drives, so likely a race condition (or failure in) mounting it. Have you guys gotten the "failed to mount" alert at all?

michaelhonan · 2024-11-18T04:44:59Z

Thanks for confirming!
I personally have not seen an error like that

nelsonjchen · 2024-11-18T04:45:12Z

I have not seen that error.

AlexandreSato · 2024-11-18T10:19:28Z

I also have not seen that error too.

AlexandreSato · 2024-11-18T12:15:03Z

grab more one: a0e5673b89222047/0000003d--2ac5e373e1/8 or in a0e5673b89222047/0000003c--0133a69a87/3 according user description

pbassut · 2024-11-23T22:39:51Z

Not sure if this is related but my C3X wasn't uploading anything. After deleting realdata it started uploading again.

adeebshihadeh · 2024-11-25T23:18:28Z

Just started looking into this:

075b133b6181e058/000000cc--7d1fff3f36/42

eventually recovers towards the end of the drive, almost like the deleter was stuck
no rlogs
bootlog for this route is missing, not on the NVMe?

fe18f736cb0d7813/0000040d--3a202cd3b9

doesn't recover like the other route did
thanks for uploading all the logs @nelsonjchen!
has the bootlog, and NVMe is mounted for that bootlog

AlexandreSato · 2024-11-25T23:52:15Z

I missed it when I did the AGNOS flash in the sound issue test

AlexandreSato · 2024-11-29T22:02:07Z

got some routes:

075b133b6181e058/0000003e--67e218c522/1
075b133b6181e058/0000003d--2e512616b9/3
i think 0000003f have something interesting too

075b133b6181e058/0000003a--3043e775b6/25

nelsonjchen · 2024-12-04T07:22:14Z

Posting links to relevant code here:

The code that disables at 7% free:

openpilot/selfdrive/selfdrived/selfdrived.py

Lines 189 to 190 in 75bf756

    
           if self.sm['deviceState'].freeSpacePercent < 7 and not SIMULATION: 
        
             self.events.add(EventName.outOfSpace)

The code that should have started deleting stuff when things reach 10% free:

https://github.com/commaai/openpilot/blob/master/system/loggerd/deleter.py

nelsonjchen · 2024-12-04T08:17:01Z

I've made a somewhat pinned branch of master-ci that makes deleter log like hell with rotation and an extra logging to see if deleter dies to something that's not an OSError to somewhere in /data verbosely and intend to "daily it":

https://github.com/nelsonjchen/openpilot/tree/33993_master-ci

(I'll rebase it onto a newer master-ci whenever I feel like it.)

It'll be logging onto the SOM's storage, but that's good since it'll be separate from the NVMe drive.

That, combined with fallocate -l 400G /data/media/dummy.dat to "trigger" it the storage being "full-ish" so it'll be in the persistently deleting state on my 1TB C3 SSD, should help trace the issue when it breaks again. I had to put this weight back on since I cleared some of the the SSD to get it working again ASAP.

Anyway, just mentioning my methodology if anyone else wants to follow along.

adeebshihadeh · 2024-12-05T05:06:16Z

Looked into it a bit more, and it's not actually the NVMe. Seems to be related to internet connectivity like @AlexandreSato mentioned. Curious what you find @nelsonjchen.

nelsonjchen · 2024-12-05T17:00:57Z

Ooh, OK. I've added a bit more logging around sus stuff with "cloud" in the name like cloudlog then.

nelsonjchen · 2024-12-09T05:39:11Z

I can't admit I've driven too much but so far, no hits on the issue. Maybe I need to try a full shutdown/power up away from wifi/internets.

nelsonjchen · 2024-12-09T05:42:07Z

@AlexandreSato , anybody who can reliably hit this? maybe they can run the hacky logging patch?

AlexandreSato · 2024-12-09T19:19:15Z

@AlexandreSato , anybody who can reliably hit this? maybe they can run the hacky logging patch?

I think the best way would be to use replay, recreating logs in a loop. Lately, I've been connecting to my phone's hotspot to work around the issue.

nelsonjchen · 2024-12-15T20:14:10Z

Unable to reproduce it so far. I guess my habits just aren't compatible. Hey, those who can reproduce it consistently, please cherry-pick my commit and share over some logs. 🤞

I don't care if you're clone, fork, uploads ignored, or whatever. Just help.

korhojoa · 2024-12-16T00:53:53Z

Should be very easy to reproduce, happens to me 100% of the time if there is no network connection.
I'll try the branch.

nelsonjchen · 2024-12-17T16:20:21Z

Thanks @korhojoa . Let me know if you need any changes or modifications to the branch as well.

inquisitev · 2024-12-30T14:22:21Z

41f4ae6d3b79a9b1 is my dongle ID.

i see this issue consistently. I have added a menu item that will show my remaining storage from the developer menu. I observed that the storage went to 38 and stayed there for the next couple hours until i stopped again, then it stopped at 21, then at 10. All of the routes that i drove appeared to be updated, but they were not being deleted from the c3. I do not have comma prime and take lots of long drives.

When i play with the deleter script to make it start deleting some stuff, it seems to try to delete only one route/seg then stops? there were LOTS of routes here.

Since that image, i sent an update with a syntax error and for some reason could not ssh back in, some service not started? ip changed? im not sure. i had to reset. I am confident that i will see this again. Ill monitor it closely and improve the logging as i go.

inquisitev · 2024-12-30T15:19:11Z

Ive started a thread in the discord: https://discord.com/channels/469524606043160576/1322679694889324616

incognitojam · 2025-01-09T20:51:54Z

it seems to try to delete only one route/seg then stops? there were LOTS of routes here.

Once the available space reaches 10% or 5GB, it will delete 1 or 2 segments to free up space, but it won't delete more than that so that we keep as many logs as possible. It will only delete more once new segments are created (so while driving it will delete one every minute).

korhojoa · 2025-01-12T08:28:47Z

log.zip
Here's log.txt from /data/33993_log/.
It would be nice to have more information on how much it's filling up so drives wouldn't be blind until it suddenly is too full and stops in a little while.

jyoung8607 · 2025-01-12T12:58:21Z

It's happening due to a time warp, which can be seen in @korhojoa's logs. The clock is going back to 2015, trigger unknown but isn't supposed to matter. Threading event.wait() is supposed to use a monotonic clock and therefore be robust to that, but it isn't. @incognitojam found python/cpython#91394 which may describe the issue.

To reproduce on device, apply this patch:

diff --git a/system/loggerd/deleter.py b/system/loggerd/deleter.py
index 2a2bd5fe1..44e796a62 100755
--- a/system/loggerd/deleter.py
+++ b/system/loggerd/deleter.py
@@ -69,7 +69,8 @@ def deleter_thread(exit_event: threading.Event):
           cloudlog.exception(f"issue deleting {delete_path}")
       exit_event.wait(.1)
     else:
-      exit_event.wait(30)
+      cloudlog.info("Sleeping")
+      exit_event.wait(5)

Then launch deleter in one terminal:

comma@comma-2a02b9c8:/data/openpilot$ LOGPRINT=debug system/loggerd/deleter.py  
Sleeping                               
Sleeping                               
Sleeping

Then tamper with the system time in another terminal:

sudo systemctl stop systemd-timesyncd
sudo date --set "1 Jan 2015 12:00:00"
date

You will stop seeing "Sleeping" output from deleter, but it will resume when the clock is fixed:

sudo systemctl start systemd-timesyncd

This explains why deleter recovers when we get Internet connectivity, because timesyncd can fix the time.

The correct fix is not yet clear to me. There are a number of hacks we could apply, but ideally if Python is supposed to be using a monotonic clock for Event.wait(), we should find out why it isn't and fix that.

jyoung8607 · 2025-01-12T16:49:44Z

Following up to myself:

The same problem cannot be reproduced on release; the threading wait() call doesn't seem vulnerable to system time changes. It remains to be determined what changed in Python or AGNOS that broke monotonic timekeeping, but having identified that it's a regression, it should be fixable somehow.

Correct behavior with:

comma@comma-2a02b9c8:/data/openpilot$ python --version
Python 3.11.4
comma@comma-2a02b9c8:/data/openpilot$ cat /VERSION
10.1

Wrong behavior with:

comma@comma-2a02b9c8:/data/openpilot$ python --version
Python 3.12.3
comma@comma-2a02b9c8:/data/openpilot$ cat /VERSION 
11.4

adeebshihadeh · 2025-01-12T16:53:33Z

Very interesting! Should be easy to remove this use of threading.Event.wait, but it's probably good to fully root cause this one...

RE the NVMe assumption: it may have just mostly affected comma three triggered by #34217.

adeebshihadeh · 2025-01-15T18:55:34Z

If this is really a Python bug, not sure there's much we can do here besides block the usage of threading.Event.wait. See how we do this for pytest.main:

openpilot/pyproject.toml

Line 255 in 2258ea3

    
           "pytest.main".msg = "pytest.main requires special handling that is easy to mess up!"

.

jyoung8607 · 2025-01-15T20:10:56Z

Python clearly goes to some effort to use a monotonic clock wherever available, but there are also platforms where there isn't one, and it may depend on underlying things like glibc. It's kind of a rabbit hole.

https://bugs.python.org/issue23428
https://bugs.python.org/issue41710

It's definitely possible to fix our Python, it used to work in AGNOS 10, and it works on desktop. The question is whether to spend the time tracing down why only the AGNOS 11 build is busted, or do we take the cheap way out.

If we take the cheap way out, we need to make sure clock rollbacks are included in what runs on the comma test rack. TBH that should probably happen anyway.

adeebshihadeh · 2025-01-18T19:09:39Z

AGNOS 11.5 #34413 bumps our Python to 3.12.8.

Unfortunately, it's still not fixed there, and we can't go to 3.13 yet.

It's possible it's just an issue with uv's Python builds (https://github.com/astral-sh/python-build-standalone).

jyoung8607 · 2025-01-18T19:47:29Z

It's possible it's just an issue with uv's Python builds (https://github.com/astral-sh/python-build-standalone).

I suspect exactly that, think an upstream build config oops like we had with libzmq.

AlexandreSato added the bug label Nov 12, 2024

adeebshihadeh added this to the 0.9.8 milestone Nov 12, 2024

adeebshihadeh changed the title ~~deleter: inconsistency after AGNOS11 update~~ deleter: inconsistency after AGNOS11 update (NVMe related) Nov 18, 2024

adeebshihadeh mentioned this issue Nov 22, 2024

agnos 11.3 #34087

Merged

3 tasks

adeebshihadeh changed the title ~~deleter: inconsistency after AGNOS11 update (NVMe related)~~ deleter: inconsistency after AGNOS11 update Jan 12, 2025

adeebshihadeh mentioned this issue Jan 19, 2025

agnos 11.5 #34413

Open

deleter: inconsistency after AGNOS11 update #33993

deleter: inconsistency after AGNOS11 update #33993

Comments

AlexandreSato commented Nov 12, 2024 • edited Loading

Describe the bug

Provide a route where the issue occurs

openpilot version

Additional info

adeebshihadeh commented Nov 12, 2024

AlexandreSato commented Nov 12, 2024

nelsonjchen commented Nov 13, 2024

AlexandreSato commented Nov 13, 2024

korhojoa commented Nov 17, 2024

michaelhonan commented Nov 18, 2024

adeebshihadeh commented Nov 18, 2024

michaelhonan commented Nov 18, 2024

nelsonjchen commented Nov 18, 2024

AlexandreSato commented Nov 18, 2024

AlexandreSato commented Nov 18, 2024 • edited Loading

pbassut commented Nov 23, 2024

adeebshihadeh commented Nov 25, 2024 • edited Loading

AlexandreSato commented Nov 25, 2024

AlexandreSato commented Nov 29, 2024

nelsonjchen commented Dec 4, 2024

nelsonjchen commented Dec 4, 2024 • edited Loading

adeebshihadeh commented Dec 5, 2024 • edited Loading

nelsonjchen commented Dec 5, 2024

nelsonjchen commented Dec 9, 2024

nelsonjchen commented Dec 9, 2024

AlexandreSato commented Dec 9, 2024

nelsonjchen commented Dec 15, 2024 • edited Loading

korhojoa commented Dec 16, 2024

nelsonjchen commented Dec 17, 2024

inquisitev commented Dec 30, 2024

inquisitev commented Dec 30, 2024

incognitojam commented Jan 9, 2025

korhojoa commented Jan 12, 2025

jyoung8607 commented Jan 12, 2025

jyoung8607 commented Jan 12, 2025

adeebshihadeh commented Jan 12, 2025 • edited Loading

adeebshihadeh commented Jan 15, 2025

jyoung8607 commented Jan 15, 2025

adeebshihadeh commented Jan 18, 2025

jyoung8607 commented Jan 18, 2025 • edited Loading

AlexandreSato commented Nov 12, 2024 •

edited

Loading

AlexandreSato commented Nov 18, 2024 •

edited

Loading

adeebshihadeh commented Nov 25, 2024 •

edited

Loading

nelsonjchen commented Dec 4, 2024 •

edited

Loading

adeebshihadeh commented Dec 5, 2024 •

edited

Loading

nelsonjchen commented Dec 15, 2024 •

edited

Loading

adeebshihadeh commented Jan 12, 2025 •

edited

Loading

jyoung8607 commented Jan 18, 2025 •

edited

Loading