Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CM5 without wifi hangs on reboot #6647

Open
nbuchwitz opened this issue Feb 4, 2025 · 14 comments
Open

CM5 without wifi hangs on reboot #6647

nbuchwitz opened this issue Feb 4, 2025 · 14 comments

Comments

@nbuchwitz
Copy link
Contributor

nbuchwitz commented Feb 4, 2025

Describe the bug

We stumbled over an issue where all CM5 without wifi seem to hang when rebooted. After some waiting the reboot is completed whereas all CM5 with wifi show no such error (same base boards, same software). As is some care cases the reboot even worked on CM5 without wifi I started to debug it further.

When reboot hangs:

Dec 01 13:28:51 RevPi systemd[1]: Shutting down.
Dec 01 13:28:51 RevPi systemd[1]: Using hardware watchdog 'Broadcom BCM2835 Watchdog timer', version 0, device /dev/watchdog0
Dec 01 13:28:51 RevPi systemd[1]: Watchdog running with a hardware timeout of 10min.
Dec 01 13:28:51 RevPi kernel: watchdog: watchdog0: watchdog did not stop!
Dec 01 13:28:51 RevPi systemd-shutdown[1]: Using hardware watchdog 'Broadcom BCM2835 Watchdog timer', version 0, device /dev/watchdog0
Dec 01 13:28:52 RevPi systemd-shutdown[1]: Watchdog running with a hardware timeout of 10min.
Dec 01 13:28:52 RevPi systemd-shutdown[1]: Syncing filesystems and block devices.
Dec 01 13:28:52 RevPi systemd-shutdown[1]: Sending SIGTERM to remaining processes...
Dec 01 13:28:52 RevPi systemd-journald[167]: Received SIGTERM from PID 1 (systemd-shutdow).
Dec 01 13:28:52 RevPi systemd-journald[167]: Journal stopped

When reboot works immediately:

Dec 01 13:29:57 RevPi136828 systemd[1]: Shutting down.
Dec 01 13:29:58 RevPi136828 systemd[1]: Using hardware watchdog 'Broadcom BCM2835 Watchdog timer', version 0, device /dev/watchdog0
Dec 01 13:29:58 RevPi136828 systemd[1]: Watchdog running with a hardware timeout of 10min.
Dec 01 13:29:58 RevPi136828 kernel: mmc1: Failed to initialize a non-removable card
Dec 01 13:29:58 RevPi136828 kernel: watchdog: watchdog0: watchdog did not stop!
Dec 01 13:29:58 RevPi136828 systemd-shutdown[1]: Using hardware watchdog 'Broadcom BCM2835 Watchdog timer', version 0, device /dev/watchdog0
Dec 01 13:29:58 RevPi136828 systemd-shutdown[1]: Watchdog running with a hardware timeout of 10min.
Dec 01 13:29:58 RevPi136828 systemd-shutdown[1]: Syncing filesystems and block devices.
Dec 01 13:29:58 RevPi136828 systemd-shutdown[1]: Sending SIGTERM to remaining processes...
Dec 01 13:29:58 RevPi136828 systemd-journald[174]: Received SIGTERM from PID 1 (systemd-shutdow).
Dec 01 13:29:58 RevPi136828 systemd-journald[174]: Journal stopped

The culprit seems to be (always present when the reboot works):

Dec 01 13:29:58 RevPi136828 kernel: mmc1: Failed to initialize a non-removable card

So it looks like there might be an issue with the unused sdio/ mmc1 which is not used on the wifi less variant of CM5. In order to verify my suspicion I've created a simple overlay which deactivates sdio 2 completely:

[...]
       fragment@13 {
               target = <&sdio2>;
               __overlay__ {
                       status = "disabled";
               };
       };

With this the reboot works reliable in all tests so far. Even though it kinda works with a custom overlay it looks wrong. It also is not a reliable solution for production as during first boot only the cm5io dt loaded by the firmware is present and a subsequent reboot will fail very often.

Same works on CM4 with / without wifi (different overlay though, but should be irrelevant as it also happens with pure CM dt).

Any ideas / insights on this?

Steps to reproduce the behaviour

  1. Boot device with CM5 without wifi module
  2. sudo reboot

Device (s)

Raspberry Pi CM5

System

2024/09/23 14:02:56 
Copyright (c) 2012 Broadcom
version 26826259 (release) (embedded)

EEPROM release: 1727096576

Kernel: 6.6.74+rpt-rpi-v8

Logs

No response

Additional context

No response

@nbuchwitz
Copy link
Contributor Author

nbuchwitz commented Feb 6, 2025

I did some further research and noticed that /sys/kernel/debug/mmc1/ios differs in good and bad cases:

pi@RevPi136828:~/debug$ diff --side-by-side working/mmc1_ios notworking/mmc1_ios 
clock:		0 Hz					      |	clock:		100000 Hz
vdd:		0 (invalid)				      |	actual clock:	100000 Hz
							      >	vdd:		21 (3.3 ~ 3.4 V)
bus mode:	2 (push-pull)					bus mode:	2 (push-pull)
chip select:	0 (don't care)					chip select:	0 (don't care)
power mode:	0 (off)					      |	power mode:	2 (on)
bus width:	0 (1 bits)					bus width:	0 (1 bits)
timing spec:	0 (legacy)					timing spec:	0 (legacy)
signal voltage:	0 (3.30 V)					signal voltage:	0 (3.30 V)
driver type:	0 (driver type B)				driver type:	0 (driver type B)

What could be the reason that power mode is set to on in the non-working (=hangs during reboot) case?

It also seems that if the power mode is set to on it is reset to off after approx. 53 seconds (see attached debug log, first line is date, then uptime in seconds and then mmc1_ios content)

debug.txt

After I performed a firmware update to 1737505011 the time after the power mode is switched to off increased to ~ 83 seconds (~ +30 seconds, 1737983339 is about 10 seconds less).

debug-fw1737505011.txt

A downgrade to 1731427844 showed the same behavior as with 1727096576 (initial firmware on this compute module): power_mode is set to off after approx 53 seconds:

debug-fw1731427844.txt

Handover to OS is about 8-9 seconds, so I don't think that the difference is resulted by something like this.

So it seems to me that this might be a firmware related issue or at least it has some influence.

Did also some testing on a CM4 without wifi and there /sys/kernel/debug/mmc1/ios shows that the interface is disabled correctly upon boot.

@pelwell
Copy link
Contributor

pelwell commented Feb 10, 2025

Hi Nicolai, we'll look into disabling SDIO2 from the firmware for non-WiFi-enabled parts.

@nbuchwitz
Copy link
Contributor Author

Thanks Phil for the update

@pelwell
Copy link
Contributor

pelwell commented Feb 10, 2025

pieeprom_cm5nowifi.zip
Here's a trial build with a theoretical fix - it should disable sdio2 on a CM5 with no WiFi. I've tried it on a Pi 5 to confirm that it isn't completely broken, but I don't have a suitable CM5 to hand - the next task is to locate one.

@nbuchwitz
Copy link
Contributor Author

Give me some minutes and I will test it, I have modules at hand ...

@nbuchwitz
Copy link
Contributor Author

nbuchwitz commented Feb 10, 2025

I can confirm, mmc1 is gone with the test firmware:

pi@RevPi136828:~$ ls -d /sys/kernel/debug/mmc?
/sys/kernel/debug/mmc0
pi@RevPi136828:~$ rpi-eeprom-update 
BOOTLOADER: up to date
   CURRENT: Mon Feb 10 12:04:08 PM UTC 2025 (1739189048)
    LATEST: Wed Jan 22 12:16:51 AM UTC 2025 (1737505011)
   RELEASE: default (/usr/lib/firmware/raspberrypi/bootloader-2712/default)
            Use raspi-config to change the release.

Reboot is also working without hang / delay.

@pelwell
Copy link
Contributor

pelwell commented Feb 10, 2025

Great. We'll get that merged, then into a release at some point.

@nbuchwitz
Copy link
Contributor Author

Thanks. In the meantime I will do some thinking and come up with some tooling for our end of line tests, so we can update the modules in place.

@nbuchwitz
Copy link
Contributor Author

Just a note for others which might need to work around the issue that the first reboot after firmware update still hangs (which is fine as we're still running the old firmware):

# set power to permanently on in order to avoid timeout of probe cycles
echo on | sudo tee /sys/class/mmc_host/mmc1/device/power/control

# unbind driver on mmc1
basename $(realpath /sys/class/mmc_host/mmc1/../..) | sudo tee /sys/bus/platform/drivers/sdhci-brcmstb/unbind

@pelwell
Copy link
Contributor

pelwell commented Feb 11, 2025

It's odd that a non-WiFi CM5 is rebooting without issue for me. I've tried rebooting before the mmc1: Failed to initialize a non-removable card error message (which I don't always see), and I've tried afterwards. This is with the stock firmware 2024/09/23, and with the latest release (Wed 22 Jan 00:16:51 UTC 2025 (1737505011)). The worst I see is a stall of up to 40 seconds until the mmc driver gives up (mmc1: Failed to initialize a non-removable card).

The power mode difference is just an indicator of whether or not the kernel has given up on there being something on that SDIO bus - it turns off the power when it loses hope.

@nbuchwitz
Copy link
Contributor Author

Yes, at some point the device is rebooting (after the driver gives up on mmc1). The issue (at least for us) is, that this causes timeouts during end of line test, as the systems expects the DUT to reboot within a reasonable period. On CM5 this extra delay after reboot is (depending on how fast the provisioning of the HAT eeprom was) up to 60 seconds which will case a timeout. Also noteworthy that on CM4 with non wifi variants this works without additional delay.

@pelwell
Copy link
Contributor

pelwell commented Feb 11, 2025

The patch to disable sdio2 has been merged, so future EEPROM builds will include it. I do wonder though if the kernel retry mechanism can be adjusted to not take quite so long.

@nbuchwitz
Copy link
Contributor Author

I do wonder though if the kernel retry mechanism can be adjusted to not take quite so long.

That was also I was initially thinking when I raised this issue. Haven't had the time to dig deeper what the differences for bcm2711 and 2712 are here, but from a first look they share at least the same driver for mmc1.

timg236 added a commit to timg236/rpi-eeprom that referenced this issue Feb 11, 2025
* recovery: Walk partitions to delete recovery.bin
  Previously, recovery.bin would fail to delete itself
  if the bootrom loaded recovery.bin where there are multiple FAT
  partitions and the first partition does not contain recovery.bin
  Update the rename code to walk the partition table to find
  the recovery.bin file to delete.
* pi5: Add config filter for simple boot variable expressions (experimental)
  Add support for a new bootloader/config.txt conditional filter
  which tests the partition, boot_count and boot_arg1 variables.
  Syntax (no spaces):
  ARG boot_arg1, boot_count or partition (EEPROM config stage only)
  [ARG=VALUE]      selected if (ARG == VALUE)
  [ARG&MASK]       selected if ((ARG & VALUE) != 0))
  [ARG&MASK=VALUE] selected if ((ARG & MASK) == VALUE)
  [ARG<VALUE]      selected if (ARG < VALUE)
  [ARG>VALUE]      selected if (ARG > VALUE)
  where VALUE and MASK are unsigned integer constants and ARG
  corresponds to the value in the reset register before the
  config file is parsed.
* pi5: Add a boot-count bootloader variable (experimental)
  Store the boot-count in a reset register and increment just
  before the boot-order state-machine. The boot-count variable
  is visible via device-tree /proc/device-tree/chosen/bootloader/count
  and can be read/set via vcmailbox
  GET: sudo vcmailbox 0x0003008d 4 4 0
  SET to N: sudo vcmailbox 0x0003808d 4 4 N
* pi5: Add user-defined reboot argument (boot_arg1) (experimental)
  Add support for a user-defined boot parameter stored in a reset-safe
  scratch register on BCM2712.  This is visible via device-tree at
  /proc/device-tree/chosen/bootloader/arg1 and via vcmailboxes
  GET arg1: sudo vcmailbox 0x0003008c 8 8 1 0
  SET arg1 to 42: sudo vcmailbox 0x0003808c 8 8 1 42
  or via config.txt
  set_reboot_arg1=42
  The variable is NOT cleared automatically and will persist until
  a power-on-reset.
* Enable overriding of high partition numbers
  Previously, the PARTITION=N bootloader config setting would only
  be used at power on reset or if the partition number passed to
  reboot was zero.
  Change the behaviour so that the bootloader config PARTITION
  property can override the reboot partition number if the reboot
  parameter is > 31.
* Disable WiFi PMIC output on CM5 modules without WiFi
  Disable the 3.7V WiFi power supply on CM5 modules which do not have a
  WiFi module fitted. This fixes some stability issues where a CM5
  would shutdown due to a spurious over-voltage condition on the
  non-connected WiFi power supply.
* Add memory barrier to the mbox handler
  Firmware issue 1944 reports receiving kernel warnings about firmware
  requests where the status return code is 0. This should not be
  possible, as handle_mbox_property always sets the top bit of the return
  code, with the bottom bit indicating success or failure. If the firmware
  had died, the firmware driver would report a timeout due to the lack of
  a mailbox interrupt, and that isn't happening.
  See: raspberrypi/firmware#1944
* support dts files with size-cells of 2
  DTS files with a top-level #size-cells of 2 make a lot of sense for
  systems with a lot of RAM, but the firmware is currently inconsistent
  in its support for that. Fix up the other cases to honor #size-cells
  and #address-cells.
* Disable SDIO2 for CM5s without WiFi
  It has been observed that CM5s without WiFi hang on reboot. To prevent
  that, disable the sdio2 node on those devices.
  See: raspberrypi/linux#6647
* arm_dt: Use dtoverlay_enable_node
  Convert the open-coded DT node status changes to use the new dtoverlay
  method dtoverlay_enable_node.
* dtoverlay: Add dtoverlay_enable_node
  Add a helper function for setting the status of a node.
timg236 added a commit to raspberrypi/rpi-eeprom that referenced this issue Feb 11, 2025
* recovery: Walk partitions to delete recovery.bin
  Previously, recovery.bin would fail to delete itself
  if the bootrom loaded recovery.bin where there are multiple FAT
  partitions and the first partition does not contain recovery.bin
  Update the rename code to walk the partition table to find
  the recovery.bin file to delete.
* pi5: Add config filter for simple boot variable expressions (experimental)
  Add support for a new bootloader/config.txt conditional filter
  which tests the partition, boot_count and boot_arg1 variables.
  Syntax (no spaces):
  ARG boot_arg1, boot_count or partition (EEPROM config stage only)
  [ARG=VALUE]      selected if (ARG == VALUE)
  [ARG&MASK]       selected if ((ARG & VALUE) != 0))
  [ARG&MASK=VALUE] selected if ((ARG & MASK) == VALUE)
  [ARG<VALUE]      selected if (ARG < VALUE)
  [ARG>VALUE]      selected if (ARG > VALUE)
  where VALUE and MASK are unsigned integer constants and ARG
  corresponds to the value in the reset register before the
  config file is parsed.
* pi5: Add a boot-count bootloader variable (experimental)
  Store the boot-count in a reset register and increment just
  before the boot-order state-machine. The boot-count variable
  is visible via device-tree /proc/device-tree/chosen/bootloader/count
  and can be read/set via vcmailbox
  GET: sudo vcmailbox 0x0003008d 4 4 0
  SET to N: sudo vcmailbox 0x0003808d 4 4 N
* pi5: Add user-defined reboot argument (boot_arg1) (experimental)
  Add support for a user-defined boot parameter stored in a reset-safe
  scratch register on BCM2712.  This is visible via device-tree at
  /proc/device-tree/chosen/bootloader/arg1 and via vcmailboxes
  GET arg1: sudo vcmailbox 0x0003008c 8 8 1 0
  SET arg1 to 42: sudo vcmailbox 0x0003808c 8 8 1 42
  or via config.txt
  set_reboot_arg1=42
  The variable is NOT cleared automatically and will persist until
  a power-on-reset.
* Enable overriding of high partition numbers
  Previously, the PARTITION=N bootloader config setting would only
  be used at power on reset or if the partition number passed to
  reboot was zero.
  Change the behaviour so that the bootloader config PARTITION
  property can override the reboot partition number if the reboot
  parameter is > 31.
* Disable WiFi PMIC output on CM5 modules without WiFi
  Disable the 3.7V WiFi power supply on CM5 modules which do not have a
  WiFi module fitted. This fixes some stability issues where a CM5
  would shutdown due to a spurious over-voltage condition on the
  non-connected WiFi power supply.
* Add memory barrier to the mbox handler
  Firmware issue 1944 reports receiving kernel warnings about firmware
  requests where the status return code is 0. This should not be
  possible, as handle_mbox_property always sets the top bit of the return
  code, with the bottom bit indicating success or failure. If the firmware
  had died, the firmware driver would report a timeout due to the lack of
  a mailbox interrupt, and that isn't happening.
  See: raspberrypi/firmware#1944
* support dts files with size-cells of 2
  DTS files with a top-level #size-cells of 2 make a lot of sense for
  systems with a lot of RAM, but the firmware is currently inconsistent
  in its support for that. Fix up the other cases to honor #size-cells
  and #address-cells.
* Disable SDIO2 for CM5s without WiFi
  It has been observed that CM5s without WiFi hang on reboot. To prevent
  that, disable the sdio2 node on those devices.
  See: raspberrypi/linux#6647
* arm_dt: Use dtoverlay_enable_node
  Convert the open-coded DT node status changes to use the new dtoverlay
  method dtoverlay_enable_node.
* dtoverlay: Add dtoverlay_enable_node
  Add a helper function for setting the status of a node.
@pelwell
Copy link
Contributor

pelwell commented Feb 11, 2025

The rescan code tries 3 different card types at 4 different clock frequencies. All of those tests involve timeouts of specific durations, so they shouldn't simply be shortened. The other approach would be to make the scanning interruptable at some granularity - at least between frequencies. There may be a way to mark that the interface is being shut down - perhaps using the rescan_disable flag - but it's not something I'd want to do hastily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants