hw-health-charm

Sector read error checks

Bug #1851389 reported by David O Neill on 2019-11-05

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	hw-health-charm	Confirmed	Wishlist	Unassigned

Bug Description

We need checks for sector read errors

Nov 05 13:30:40 dcs1-clp-nod9 smartd[9854]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 75 to 74
Nov 05 13:30:40 dcs1-clp-nod9 smartd[9854]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 25 to 26
Nov 05 13:30:41 dcs1-clp-nod9 smartd[9854]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 76 to 75
Nov 05 13:30:41 dcs1-clp-nod9 smartd[9854]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 24 to 25
Nov 05 13:30:41 dcs1-clp-nod9 smartd[9854]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 83 to 84
Nov 05 13:30:41 dcs1-clp-nod9 smartd[9854]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 82 to 83
Nov 05 13:30:41 dcs1-clp-nod9 smartd[9854]: Device: /dev/sde [SAT], 1 Currently unreadable (pending) sectors
Nov 05 13:30:41 dcs1-clp-nod9 smartd[9854]: Device: /dev/sdf [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 100 to 73
Nov 05 13:30:41 dcs1-clp-nod9 smartd[9854]: Device: /dev/sdf [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 100 to 1
Nov 05 13:30:41 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_00] [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 80 to 82
Nov 05 13:30:41 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_00] [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 1 to 2
Nov 05 13:30:42 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_01] [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 81 to 83
Nov 05 13:30:42 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_01] [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 1 to 2
Nov 05 13:30:42 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_02] [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 75 to 74
Nov 05 13:30:42 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_02] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 25 to 26
Nov 05 13:30:42 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_04] [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 83 to 84
Nov 05 13:30:42 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_05] [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 82 to 83
Nov 05 13:30:42 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_06] [SAT], 1 Currently unreadable (pending) sectors
Nov 05 13:30:42 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_07] [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 100 to 73
Nov 05 13:30:42 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_07] [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 100 to 1

Andrea Ieri (aieri) on 2020-01-30

affects:

nrpe-charm → hw-health-charm

Celia Wang (ziyiwang) on 2020-04-02

Changed in charm-hw-health:
importance:	Undecided → High

Peter Sabaini (peter-sabaini) on 2020-04-23

Changed in charm-hw-health:
status:	New → Confirmed
assignee:	nobody → Peter Sabaini (peter-sabaini)

Revision history for this message

Peter Sabaini (peter-sabaini) wrote on 2020-04-23:

I've been looking on how to improve disk monitoring a bit. Imho monitoring individual read errors doesn't tell you too much about disk health, you'd need to monitor crossing some threshold of read errors or rate of read errors to make predictions. One option I've looked at is a nagios plugin from Thomas Krenn[0] which queries smartctl. Unfortunately driving smartctl with RAIDed disks is a bit vendor-specific, and this plugin would require you to keep a database of drive specifics updated.

Otoh, we do have support for some checking drive health already:

a) via vendor specific tools such as megacli which report drive state

b) via ipmi for drive faults and also for predictive failures at least for some systems

At this point I wonder what additional smartctl monitoring would buy us. I'm marking this as wishlist as it's a new feature

[0] https://github.com/thomas-krenn/check_smart_attributes

Changed in charm-hw-health:
importance:	High → Wishlist
assignee:	Peter Sabaini (peter-sabaini) → nobody

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.