Sector read error checks
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
hw-health-charm |
Confirmed
|
Wishlist
|
Unassigned |
Bug Description
We need checks for sector read errors
Nov 05 13:30:40 dcs1-clp-nod9 smartd[9854]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_
Nov 05 13:30:40 dcs1-clp-nod9 smartd[9854]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 25 to 26
Nov 05 13:30:41 dcs1-clp-nod9 smartd[9854]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_
Nov 05 13:30:41 dcs1-clp-nod9 smartd[9854]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 24 to 25
Nov 05 13:30:41 dcs1-clp-nod9 smartd[9854]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 83 to 84
Nov 05 13:30:41 dcs1-clp-nod9 smartd[9854]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 82 to 83
Nov 05 13:30:41 dcs1-clp-nod9 smartd[9854]: Device: /dev/sde [SAT], 1 Currently unreadable (pending) sectors
Nov 05 13:30:41 dcs1-clp-nod9 smartd[9854]: Device: /dev/sdf [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 100 to 73
Nov 05 13:30:41 dcs1-clp-nod9 smartd[9854]: Device: /dev/sdf [SAT], SMART Usage Attribute: 195 Hardware_
Nov 05 13:30:41 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_00] [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 80 to 82
Nov 05 13:30:41 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_00] [SAT], SMART Usage Attribute: 195 Hardware_
Nov 05 13:30:42 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_01] [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 81 to 83
Nov 05 13:30:42 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_01] [SAT], SMART Usage Attribute: 195 Hardware_
Nov 05 13:30:42 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_02] [SAT], SMART Usage Attribute: 190 Airflow_
Nov 05 13:30:42 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_02] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 25 to 26
Nov 05 13:30:42 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_04] [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 83 to 84
Nov 05 13:30:42 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_05] [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 82 to 83
Nov 05 13:30:42 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_06] [SAT], 1 Currently unreadable (pending) sectors
Nov 05 13:30:42 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_07] [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 100 to 73
Nov 05 13:30:42 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_07] [SAT], SMART Usage Attribute: 195 Hardware_
affects: | nrpe-charm → hw-health-charm |
Changed in charm-hw-health: | |
importance: | Undecided → High |
Changed in charm-hw-health: | |
status: | New → Confirmed |
assignee: | nobody → Peter Sabaini (peter-sabaini) |
I've been looking on how to improve disk monitoring a bit. Imho monitoring individual read errors doesn't tell you too much about disk health, you'd need to monitor crossing some threshold of read errors or rate of read errors to make predictions. One option I've looked at is a nagios plugin from Thomas Krenn[0] which queries smartctl. Unfortunately driving smartctl with RAIDed disks is a bit vendor-specific, and this plugin would require you to keep a database of drive specifics updated.
Otoh, we do have support for some checking drive health already:
a) via vendor specific tools such as megacli which report drive state
b) via ipmi for drive faults and also for predictive failures at least for some systems
At this point I wonder what additional smartctl monitoring would buy us. I'm marking this as wishlist as it's a new feature
[0] https:/ /github. com/thomas- krenn/check_ smart_attribute s