During the first boot up sequence, one of my SPARC T4-2 machines started logging hardware errors. This was a brand new machine, never used before, which I unpacked myself from a sealed box.
[CPU 0:0:0] NOTICE: Initializing MCU 0
[CPU 1:0:0] NOTICE: Initializing MCU 0
[CPU 0:0:0] NOTICE: Initializing MCU 1
[CPU 1:0:0] NOTICE: Initializing MCU 1
[CPU 0:0:0] NOTICE: SMI Channel 0, SB Mapping 0 -- ERRCNT: 0x0 LNERR: 0x0
[CPU 1:0:0] NOTICE: SMI Channel 0, SB Mapping 0 -- ERRCNT: 0x0 LNERR: 0x0
[CPU 0:0:0] NOTICE: SMI Channel 0, SB Mapping 1 -- ERRCNT: 0x0 LNERR: 0x0
[CPU 1:0:0] NOTICE: SMI Channel 0, SB Mapping 1 -- ERRCNT: 0x0 LNERR: 0x0
[CPU 0:0:0] NOTICE: SMI Channel 1, SB Mapping 0 -- ERRCNT: 0x0 LNERR: 0x0
[CPU 1:0:0] NOTICE: SMI Channel 1, SB Mapping 0 -- ERRCNT: 0x0 LNERR: 0x0
[CPU 0:0:0] NOTICE: SMI Channel 1, SB Mapping 1 -- ERRCNT: 0x0 LNERR: 0x0
[CPU 1:0:0] NOTICE: SMI Channel 1, SB Mapping 1 -- ERRCNT: 0x0 LNERR: 0x0
[CPU 1:0:0] NOTICE: SMI Channel 0, SB Mapping 0 -- ERRCNT: 0x0 LNERR: 0x0
[CPU 1:0:0] NOTICE: SMI Channel 0, SB Mapping 1 -- ERRCNT: 0x0 LNERR: 0x0
[CPU 1:0:0] NOTICE: SMI Channel 1, SB Mapping 0 -- ERRCNT: 0x0 LNERR: 0x0
[CPU 1:0:0] NOTICE: SMI Channel 1, SB Mapping 1 -- ERRCNT: 0x0 LNERR: 0x0
[CPU 0:0:0] NOTICE: SMI Channel 0, SB Mapping 0 -- ERRCNT: 0x0 LNERR: 0x0
[CPU 0:0:0] NOTICE: SMI Channel 0, SB Mapping 1 -- ERRCNT: 0x0 LNERR: 0x0
[CPU 0:0:0] NOTICE: SMI Channel 1, SB Mapping 0 -- ERRCNT: 0x0 LNERR: 0x0
[CPU 0:0:0] NOTICE: SMI Channel 1, SB Mapping 1 -- ERRCNT: 0xf LNERR: 0x188
[CPU 0:0:0] ERROR: N0.MCU1: SMI link failed memory link test.
[CPU 0:0:0] ERROR: /SYS/MB/CMP0/MCU1 failed to initialize
[CPU 0:0:0] NOTICE: Updating Config Information for Guest Manager
[CPU 0:0:0] NOTICE: Issuing Host warm Reset
[CPU 1:0:0] NOTICE: Reconfiguring System
[CPU 0:0:0] NOTICE: Reconfiguring System
[CPU 1:0:0] NOTICE: MCU0: Memory Capacity is 64GB
[CPU 0:0:0] NOTICE: /SYS/MB/CMP0/MCU1 is disabled
[CPU 0:0:0] NOTICE: MCU0: Memory Capacity is 64GB
[CPU 1:0:0] NOTICE: MCU1: Memory Capacity is 64GB
[CPU 0:0:0] ERROR: /SYS/MB/CMP0/L3T1: unusable (MCU disabled). Not configured
[CPU 0:0:0] ERROR: /SYS/MB/CMP0/L3T3: unusable (MCU disabled). Not configured
[CPU 0:0:0] ERROR: /SYS/MB/CMP0/L3T5: unusable (MCU disabled). Not configured
[CPU 0:0:0] ERROR: /SYS/MB/CMP0/L3T7: unusable (MCU disabled). Not configured
[CPU 0:0:0] ERROR: /SYS/MB/CMP1/L3T0: unusable (Global symmetry rules). Not configured
[CPU 0:0:0] ERROR: /SYS/MB/CMP1/L3T2: unusable (Global symmetry rules). Not configured
[CPU 0:0:0] ERROR: /SYS/MB/CMP1/L3T4: unusable (Global symmetry rules). Not configured
[CPU 0:0:0] ERROR: /SYS/MB/CMP1/L3T6: unusable (Global symmetry rules). Not configured
[CPU 1:0:0] ERROR: /SYS/MB/CMP1/MCU0: Not configured due to partial cache mode
[CPU 0:0:0] ERROR: /SYS/MB/CMP0/CORE1: Not configured (Required L3T not available)
[CPU 1:0:0] ERROR: /SYS/MB/CMP1/CORE1: Not configured (Required L3T not available)
[CPU 0:0:0] ERROR: /SYS/MB/CMP0/CORE3: Not configured (Required L3T not available)
[CPU 1:0:0] ERROR: /SYS/MB/CMP1/CORE3: Not configured (Required L3T not available)
[CPU 0:0:0] ERROR: /SYS/MB/CMP0/CORE5: Not configured (Required L3T not available)
[CPU 1:0:0] ERROR: /SYS/MB/CMP1/CORE5: Not configured (Required L3T not available)
[CPU 0:0:0] ERROR: /SYS/MB/CMP0/CORE7: Not configured (Required L3T not available)
[CPU 1:0:0] ERROR: /SYS/MB/CMP1/CORE7: Not configured (Required L3T not available)
[CPU 0:0:0] NOTICE: Usable strands: 00ff00ff00ff00ff
[CPU 1:0:0] NOTICE: Usable strands: 00ff00ff00ff00ff
[CPU 0:0:0] NOTICE: System memory capacity is 128GB
[CPU 0:0:0] NOTICE: Enabling caches
[CPU 1:0:0] NOTICE: Enabling caches
[CPU 0:0:0] NOTICE: L3 Banks Enabled: 55
[CPU 1:0:0] NOTICE: L3 Banks Enabled: aa
Similar issue was previously logged with Sun/Oracle, however access to the root cause and its resolution is blocked for anyone who does not have a valid support contract. How nice of them.
I assumed the issue was related to one of the memory riser boards, so I carried out the following steps:
- Cleared the fault in the system ILOM.
- Powered off the system and unplugged power cables. Unplugging is required since there is normally a standby power of 3.3V.
- Unplugged each memory riser, one by one, and then reseated individual memory modules.
Next time when I powered on the system, the fault was resolved and did not appear again.
Not sure of the actual root cause, possibly a bad connection with one of the memory modules. If anyone knows the details, let me know in the comments.
No comments:
Post a Comment