18 July 2024

Debugging Windows 11 crashes

Once to twice a week, I have been finding my system asleep when it should have been folding. After looking at "Event Viewer" after each of these I noticed a trend of a Kernel-Power event proceeded (but not immediately) by a volmgr one.

To view the events open up "Event Viewer", expand "Windows Logs", and click on "System"

Simplified Example:

LevelDate and TimeSourceEvent IDTask Category
Critical5/7/2024 2:13:36 AMKernel-Power41(63)

The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly.
Error5/7/2024 2:13:36 AMvolmgr162None

Dump file generation succeded.
Warning5/7/2024 2:11:00 AMDisplay4101None

Display driver amduw23g stopped responding and has successfully recovered.
Warning5/7/2024 2:06:07 AMDisplay4101None

Display driver amduw23g stopped responding and has successfully recovered.


Updating Drivers

As a first step to try and remedy this, I always update drivers. I updated the following:

  • chipset: 
    • 4.07.13.2243 -> 5.11.02.217 (from Asus)
    • 6.02.07.2300 (from AMD after above didn't fix it)
  • video: a version installed in 2024 -> 24.5.1


Updating the BIOS

I found some posts online that said it could be due to the motherboard, so I decided to try updating the BIOS to see if that would help. Unfortunately, it did not.

Before updating the BIOS/UEFI make sure to write down your changes as they will be reset.

BIOS/UEFI Settings

  • Expo -> Enabled
  • Fan Curves
    • CPU
      • 20C -> 20%
      • 60C -> 40%
      • 85C -> 70%
      • 90C -> 100%
    • System
      • 20C -> 30%
      • 50C -> 50%
      • 85C -> 80%
      • 90C -> 100%
  • Advanced Mode (F7)
    • Tool -> ASUS Armoury Crate
      •  Download & Install -> Disabled
    • AI Tweaker -> Precision Boost Override
      • Curve Optimizer
        • All Cores
        • Negative
        • 30
      • Precision Boost Override -> AMD Eco Mode
      • AMD Eco Mode -> cTDP 65W
    • AI Tweaker -> SOC Offset -> negative -> 0.03 (however this caused the pc to hang on warm-boots so I use the below settings instead)
    • AI Tweaker -> CPU SOC Voltage -> Manual
      • VDD SOC Voltage Override -> 1.2

Update the BIOS

After updating the BIOS, you will need to reapply the settings.


Old video drivers

Since all the updates didn't work, I tried installing old video drivers that I knew worked from 24.5.1 to 23.10.2. Luckily this seems to have done the trick and my system is stable again with >27 days of uptime.


Slow folding

Unfortunately, it looks like my solution may be short lived as to get expected folding performance, I need to update to 24.6.1: https://foldingforum.org/viewtopic.php?t=41637&sid=d9b1c6f33f52801aaca8c31fc3fe52d1 So fingers crossed that this release is also stable.

Update 2024-07-29: I had a freeze after roughly 11 days of uptime

Update 2024-07-31: My PC crashed overnight after only 1.5 days of uptime


24.7.1

Update 2024-07-31: I updated the graphics drivers to 24.7.1. Here is the process that I used:

  • Downloaded the AMD driver cleanup utility (from here)
  • Downloaded the full updated AMD graphics driver (for 5600 XT)
  • Disabled Ethernet/Wi-Fi
    • I do this so that Windows doesn't try to "helpfully" download and install other graphic driver versions
  • Ran the AMD driver cleanup utility
    • Had it reboot me into Safe-Mode
    • It removed the drivers and prompted me to reboot, which I did
  • Installed the new drivers (had to approve the big red scary box because of no internet)
  • Rebooted
  • Enabled Ethernet/Wi-Fi


Update 2024-08-02: My computer has crashed twice in the past 2 days so I decided to try DDU

  • Downloaded Display Driver Uninstaller (aka DDU)
    • I downloaded from Guru3D
  • Disabled Ethernet/Wi-Fi
  • Rebooted into Safe Mode
  • Extracted the EXE from the ZIP file
  • Ran the EXE, it will extract even more files
  • Ran "Display Driver Uninstaller.exe"
  • Unchecked the final option (Disable Windows Update Driver Downloads)
    • Since we already disabled internet access this is unnecessary and DDU recommends reenabling afterwards anyway so this saves a step
  • Close the options
  • Device Type -> GPU
  • Device -> AMD
  • Clicked "Clean and restart"
  • Waited for it to do its thing and reboot
  • Installed the 24.7.1 drivers again
  • Rebooted
  • Enabled Ethernet/Wi-Fi

 

Appendix

Sources

15 July 2024

Intel SSD upgrade for r730xd

I got 2 used intel ssds to improve VM performance and to reduce power usage


Check SMART attributes


Firmware Update Adventure

Intel Memory and Storage Tool CLI

My first attempt was to use Intel cli to update the firmware, but it did not work.


Solidigm

My second attempt was to use Solidigm cli to update the firmware, but it did not work through the raid controller and had to be attached to a different SATA controller.

  • Installation
  • Usage
  • Firmware Update
    • sst load -ssd <index>
    • Status : Firmware update failed.
      • If your drives are behind a RAID card, you may have to remove them for updating.
  • Firmware Update using a USB to SATA enclosure
    • sst show -ssd
    • Unfortunately, the drive no longer appeared :-(
  • Firmware Update using Windows with USB to SATA enclosure
    • The drive appeared and would run tests
    • However attempting to update the firmware resulted in an error and the USB port being disabled until the PC was rebooted
  • Firmware Update using a JMB585 PCIe to SATA card, a SATA to eSATA bracket, and an external eSATA enclosure
    • Finally was able to update the firmware
    • Notes:
      • the firmware did not jump to the latest version and had to be updated incrementally
        • G2010140 -> G2010160 -> G2010170
      • the firmware update requested a reboot after installing so I did that
    • sst show -ssd
    • sst show -a -ssd <index> | grep Firmware
    • sst load -ssd <index>
    • Status : Firmware updated successfully. Please reboot the system.


Replacing the existing 1.2 TB drives

  • Setup the partitions
    • I used my rpool mirror instructions from here but made the following changes
      • removed the last-lba line
      • removed the start from all lines
      • removed the size from the final partition
      • don't run the zpool attach
  • Copy the rpool data
    • I am following the wizardry provided here
    • Add a mirror of the new disks to form a striped mirror
      • zpool add rpool mirror /dev/disk/by-id/ata-INTEL_SSDSC2BX800G4_<SERIAL1>-part3 /dev/disk/by-id/ata-INTEL_SSDSC2BX800G4_<SERIAL2>-part3
    • Check the mirror names and size
      • zpool status rpool
      • zpool list rpool
    • Remove the old mirror
      • zpool remove rpool mirror-0
    • Check for when "Evacuation of mirror" is done
      • zpool status rpool
      • remove: Removal of vdev 0 copied 10.7G in 0h1m
    • Check new size
      • zpool list rpool
  • Shutdown and remove the drives and reboot to make sure everything is working as intended
  • Remove the old drives from proxmox-boot-tool
    • proxmox-boot-tool status
    • proxmox-boot-tool clean
    • proxmox-boot-tool status
  • Remove all data from old drives
    • put the drives back in
    • wipe them using your favorite tool (eg. shred or dd)
      • dd if=/dev/zero of=/dev/sdX -bs=1M status=progress
      • dd if=/dev/urandom of=/dev/sdX -bs=1M status=progress


Conclusion

I saw a 7-8W drop in idle power usage and an increase in responsiveness.


Appendix

Sources