Hi
I am trying to detect correctable memory errors in the DIMM modules my servers. It has ESXi 6.5 running on it.
I ran following esxcli to detect the errors
-----------------------
esxcli hardware ipmi sel list | grep -B5 -A 3 -i -E "memory|correctable"
Record:390
Record Id: 390
When: 2019-02-28T01:08:16
Event Type: 111 (Unknown)
SEL Type: 2 (System Event)
Message: Assert + Memory Correctable ECC
Sensor Number: 83
Raw:
Formatted-Raw:
--
Record:393
Record Id: 393
When: 2019-04-25T06:29:14
Event Type: 111 (Unknown)
SEL Type: 2 (System Event)
Message: Assert + Memory Correctable ECC
Sensor Number: 83
Raw:
Formatted-Raw:
-------------------------
It shows 2 events that happened with sensor number: 83. How can I use this information to find out which memory module (actual slot number) it happened in?
So basically how can I map the sensor number from the command output above with a DIMM slot information e.g DIMMA1 etc..
Thank you
Dee
Hello.
A standard server has a hardware management interface that is generically known as IPMI. In different masks it is called IMM, BMC, XClarity, ILO and more.
The IPMI has a port assigned (labeled) and in standard form is configured to obtain an IP from a DHCP service, it can also be configured with a fixed IP, entering the UEFI (BIOS) of the Server.
If you have access to the IPMI of your server, there you can have more details of the reported memory event.
What make/model of server do you have?
If it is IBM or Lenovo Server you can get a lot of Hardware data online using the DSA tool.
Memory Correctable ECC events are not considered serious errors, but a count is kept (PFA) that when exceeding the limit defined by the manufacturer it is recommended to plan the change.
Hello.
A standard server has a hardware management interface that is generically known as IPMI. In different masks it is called IMM, BMC, XClarity, ILO and more.
The IPMI has a port assigned (labeled) and in standard form is configured to obtain an IP from a DHCP service, it can also be configured with a fixed IP, entering the UEFI (BIOS) of the Server.
If you have access to the IPMI of your server, there you can have more details of the reported memory event.
What make/model of server do you have?
If it is IBM or Lenovo Server you can get a lot of Hardware data online using the DSA tool.
Memory Correctable ECC events are not considered serious errors, but a count is kept (PFA) that when exceeding the limit defined by the manufacturer it is recommended to plan the change.
Hi e_espinel,
Thank you for the response.
I have Dell Power edge and Hp servers.
| Re: Memory Correctable ECC events are not considered serious errors, but a count is kept (PFA) that when exceeding the limit | defined by the manufacturer it is recommended to plan the change.
Yes, exactly that's what I am trying to monitor to see how many times the correctable error was reported. To do that I run the command
esxcli hardware ipmi sel list
Record:390
Record Id: 390
When: 2019-02-28T01:08:16
Event Type: 111 (Unknown)
SEL Type: 2 (System Event)
Message: Assert + Memory Correctable ECC
Sensor Number: 83
Raw:
Formatted-Raw:
There were more events like this....
This tells me that ECC correctable memory event happened on the given date and time. But I don't know which memory module it happened in. It only says Sensor Number: 83 . So is there any command or cli tool that can tell me which memory module this sensor number belongs to as I have multiple DIMM modules on my server.
Thank you so much 🙂