Tuesday, October 15, 2024

SPARC T4-2: SMI link failed memory link test

During the first boot up sequence, one of my SPARC T4-2 machines started logging hardware errors. This was a brand new machine, never used before, which I unpacked myself from a sealed box.

[CPU 0:0:0] NOTICE: Initializing MCU 0 [CPU 1:0:0] NOTICE: Initializing MCU 0 [CPU 0:0:0] NOTICE: Initializing MCU 1 [CPU 1:0:0] NOTICE: Initializing MCU 1 [CPU 0:0:0] NOTICE: SMI Channel 0, SB Mapping 0 -- ERRCNT: 0x0 LNERR: 0x0 [CPU 1:0:0] NOTICE: SMI Channel 0, SB Mapping 0 -- ERRCNT: 0x0 LNERR: 0x0 [CPU 0:0:0] NOTICE: SMI Channel 0, SB Mapping 1 -- ERRCNT: 0x0 LNERR: 0x0 [CPU 1:0:0] NOTICE: SMI Channel 0, SB Mapping 1 -- ERRCNT: 0x0 LNERR: 0x0 [CPU 0:0:0] NOTICE: SMI Channel 1, SB Mapping 0 -- ERRCNT: 0x0 LNERR: 0x0 [CPU 1:0:0] NOTICE: SMI Channel 1, SB Mapping 0 -- ERRCNT: 0x0 LNERR: 0x0 [CPU 0:0:0] NOTICE: SMI Channel 1, SB Mapping 1 -- ERRCNT: 0x0 LNERR: 0x0 [CPU 1:0:0] NOTICE: SMI Channel 1, SB Mapping 1 -- ERRCNT: 0x0 LNERR: 0x0 [CPU 1:0:0] NOTICE: SMI Channel 0, SB Mapping 0 -- ERRCNT: 0x0 LNERR: 0x0 [CPU 1:0:0] NOTICE: SMI Channel 0, SB Mapping 1 -- ERRCNT: 0x0 LNERR: 0x0 [CPU 1:0:0] NOTICE: SMI Channel 1, SB Mapping 0 -- ERRCNT: 0x0 LNERR: 0x0 [CPU 1:0:0] NOTICE: SMI Channel 1, SB Mapping 1 -- ERRCNT: 0x0 LNERR: 0x0 [CPU 0:0:0] NOTICE: SMI Channel 0, SB Mapping 0 -- ERRCNT: 0x0 LNERR: 0x0 [CPU 0:0:0] NOTICE: SMI Channel 0, SB Mapping 1 -- ERRCNT: 0x0 LNERR: 0x0 [CPU 0:0:0] NOTICE: SMI Channel 1, SB Mapping 0 -- ERRCNT: 0x0 LNERR: 0x0 [CPU 0:0:0] NOTICE: SMI Channel 1, SB Mapping 1 -- ERRCNT: 0xf LNERR: 0x188 [CPU 0:0:0] ERROR: N0.MCU1: SMI link failed memory link test. [CPU 0:0:0] ERROR: /SYS/MB/CMP0/MCU1 failed to initialize [CPU 0:0:0] NOTICE: Updating Config Information for Guest Manager [CPU 0:0:0] NOTICE: Issuing Host warm Reset [CPU 1:0:0] NOTICE: Reconfiguring System [CPU 0:0:0] NOTICE: Reconfiguring System [CPU 1:0:0] NOTICE: MCU0: Memory Capacity is 64GB [CPU 0:0:0] NOTICE: /SYS/MB/CMP0/MCU1 is disabled [CPU 0:0:0] NOTICE: MCU0: Memory Capacity is 64GB [CPU 1:0:0] NOTICE: MCU1: Memory Capacity is 64GB [CPU 0:0:0] ERROR: /SYS/MB/CMP0/L3T1: unusable (MCU disabled). Not configured [CPU 0:0:0] ERROR: /SYS/MB/CMP0/L3T3: unusable (MCU disabled). Not configured [CPU 0:0:0] ERROR: /SYS/MB/CMP0/L3T5: unusable (MCU disabled). Not configured [CPU 0:0:0] ERROR: /SYS/MB/CMP0/L3T7: unusable (MCU disabled). Not configured [CPU 0:0:0] ERROR: /SYS/MB/CMP1/L3T0: unusable (Global symmetry rules). Not configured [CPU 0:0:0] ERROR: /SYS/MB/CMP1/L3T2: unusable (Global symmetry rules). Not configured [CPU 0:0:0] ERROR: /SYS/MB/CMP1/L3T4: unusable (Global symmetry rules). Not configured [CPU 0:0:0] ERROR: /SYS/MB/CMP1/L3T6: unusable (Global symmetry rules). Not configured [CPU 1:0:0] ERROR: /SYS/MB/CMP1/MCU0: Not configured due to partial cache mode [CPU 0:0:0] ERROR: /SYS/MB/CMP0/CORE1: Not configured (Required L3T not available) [CPU 1:0:0] ERROR: /SYS/MB/CMP1/CORE1: Not configured (Required L3T not available) [CPU 0:0:0] ERROR: /SYS/MB/CMP0/CORE3: Not configured (Required L3T not available) [CPU 1:0:0] ERROR: /SYS/MB/CMP1/CORE3: Not configured (Required L3T not available) [CPU 0:0:0] ERROR: /SYS/MB/CMP0/CORE5: Not configured (Required L3T not available) [CPU 1:0:0] ERROR: /SYS/MB/CMP1/CORE5: Not configured (Required L3T not available) [CPU 0:0:0] ERROR: /SYS/MB/CMP0/CORE7: Not configured (Required L3T not available) [CPU 1:0:0] ERROR: /SYS/MB/CMP1/CORE7: Not configured (Required L3T not available) [CPU 0:0:0] NOTICE: Usable strands: 00ff00ff00ff00ff [CPU 1:0:0] NOTICE: Usable strands: 00ff00ff00ff00ff [CPU 0:0:0] NOTICE: System memory capacity is 128GB [CPU 0:0:0] NOTICE: Enabling caches [CPU 1:0:0] NOTICE: Enabling caches [CPU 0:0:0] NOTICE: L3 Banks Enabled: 55 [CPU 1:0:0] NOTICE: L3 Banks Enabled: aa

Similar issue was previously logged with Sun/Oracle, however access to the root cause and its resolution is blocked for anyone who does not have a valid support contract. How nice of them.

I assumed the issue was related to one of the memory riser boards, so I carried out the following steps:

  1. Cleared the fault in the system ILOM.
  2. Powered off the system and unplugged power cables. Unplugging is required since there is normally a standby power of 3.3V.
  3. Unplugged each memory riser, one by one, and then reseated individual memory modules.
Next time when I powered on the system, the fault was resolved and did not appear again.


Not sure of the actual root cause, possibly a bad connection with one of the memory modules. If anyone knows the details, let me know in the comments.

No comments:

Post a Comment