Is this drive dying?

Red Squirrel · Aug 10, 2012

Noticed all these errors in dmesg:

Code:

ata6.00: status: { DRDY ERR }
ata6.00: error: { UNC }
ata6.00: configured for UDMA/133
ata6: EH complete
sd 5:0:0:0: [sde] 1953525168 512-byte hardware sectors (1000205 MB)
sd 5:0:0:0: [sde] Write Protect is off
sd 5:0:0:0: [sde] Mode Sense: 00 3a 00 00
sd 5:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
ata6.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
ata6.00: irq_stat 0x40000008
ata6.00: cmd 60/80:00:00:ab:f3/00:00:73:00:00/40 tag 0 ncq 65536 in
         res 41/40:00:51:ab:f3/0e:00:73:00:00/40 Emask 0x409 (media error) <F>
ata6.00: status: { DRDY ERR }
ata6.00: error: { UNC }
ata6.00: configured for UDMA/133
ata6: EH complete
sd 5:0:0:0: [sde] 1953525168 512-byte hardware sectors (1000205 MB)
sd 5:0:0:0: [sde] Write Protect is off
sd 5:0:0:0: [sde] Mode Sense: 00 3a 00 00
sd 5:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
ata6.00: exception Emask 0x0 SAct 0x7 SErr 0x0 action 0x0
ata6.00: irq_stat 0x40000008
ata6.00: cmd 60/80:00:00:ab:f3/00:00:73:00:00/40 tag 0 ncq 65536 in
         res 41/40:00:51:ab:f3/0e:00:73:00:00/40 Emask 0x409 (media error) <F>
ata6.00: status: { DRDY ERR }
ata6.00: error: { UNC }
ata6.00: configured for UDMA/133
ata6: EH complete
sd 5:0:0:0: [sde] 1953525168 512-byte hardware sectors (1000205 MB)
sd 5:0:0:0: [sde] Write Protect is off
sd 5:0:0:0: [sde] Mode Sense: 00 3a 00 00
sd 5:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
ata6.00: exception Emask 0x0 SAct 0x5 SErr 0x0 action 0x0
ata6.00: irq_stat 0x40000008
ata6.00: cmd 60/80:10:00:ab:f3/00:00:73:00:00/40 tag 2 ncq 65536 in
         res 41/40:00:51:ab:f3/0e:00:73:00:00/40 Emask 0x409 (media error) <F>
ata6.00: status: { DRDY ERR }
ata6.00: error: { UNC }
ata6.00: configured for UDMA/133
sd 5:0:0:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
sd 5:0:0:0: [sde] Sense Key : Medium Error [current] [descriptor]
Descriptor sense data with sense descriptors (in hex):
        72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
        73 f3 ab 51
sd 5:0:0:0: [sde] Add. Sense: Unrecovered read error - auto reallocate failed
end_request: I/O error, dev sde, sector 1945348945
ata6: EH complete
sd 5:0:0:0: [sde] 1953525168 512-byte hardware sectors (1000205 MB)
sd 5:0:0:0: [sde] Write Protect is off
sd 5:0:0:0: [sde] Mode Sense: 00 3a 00 00
sd 5:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
ata6.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
ata6.00: irq_stat 0x40000008
ata6.00: cmd 60/80:00:00:af:f3/00:00:73:00:00/40 tag 0 ncq 65536 in
         res 41/40:00:52:af:f3/0e:00:73:00:00/40 Emask 0x409 (media error) <F>
ata6.00: status: { DRDY ERR }
ata6.00: error: { UNC }
ata6.00: configured for UDMA/133
ata6: EH complete
sd 5:0:0:0: [sde] 1953525168 512-byte hardware sectors (1000205 MB)
sd 5:0:0:0: [sde] Write Protect is off
sd 5:0:0:0: [sde] Mode Sense: 00 3a 00 00
sd 5:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
ata6.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
ata6.00: irq_stat 0x40000008
ata6.00: cmd 60/80:00:00:af:f3/00:00:73:00:00/40 tag 0 ncq 65536 in
         res 41/40:00:52:af:f3/0e:00:73:00:00/40 Emask 0x409 (media error) <F>
ata6.00: status: { DRDY ERR }
ata6.00: error: { UNC }
ata6.00: configured for UDMA/133
ata6: EH complete
sd 5:0:0:0: [sde] 1953525168 512-byte hardware sectors (1000205 MB)
sd 5:0:0:0: [sde] Write Protect is off
sd 5:0:0:0: [sde] Mode Sense: 00 3a 00 00
sd 5:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
ata6.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
ata6.00: irq_stat 0x40000008
ata6.00: cmd 60/80:00:00:af:f3/00:00:73:00:00/40 tag 0 ncq 65536 in
         res 41/40:00:52:af:f3/0e:00:73:00:00/40 Emask 0x409 (media error) <F>
ata6.00: status: { DRDY ERR }
ata6.00: error: { UNC }
ata6.00: configured for UDMA/133
ata6: EH complete
sd 5:0:0:0: [sde] 1953525168 512-byte hardware sectors (1000205 MB)
sd 5:0:0:0: [sde] Write Protect is off
sd 5:0:0:0: [sde] Mode Sense: 00 3a 00 00
sd 5:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
ata6.00: exception Emask 0x0 SAct 0xf SErr 0x0 action 0x0
ata6.00: irq_stat 0x40000008
ata6.00: cmd 60/80:00:00:af:f3/00:00:73:00:00/40 tag 0 ncq 65536 in
         res 41/40:00:52:af:f3/0e:00:73:00:00/40 Emask 0x409 (media error) <F>
ata6.00: status: { DRDY ERR }
ata6.00: error: { UNC }
ata6.00: configured for UDMA/133
ata6: EH complete
sd 5:0:0:0: [sde] 1953525168 512-byte hardware sectors (1000205 MB)
sd 5:0:0:0: [sde] Write Protect is off
sd 5:0:0:0: [sde] Mode Sense: 00 3a 00 00
sd 5:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
ata6.00: exception Emask 0x0 SAct 0xf SErr 0x0 action 0x0
ata6.00: irq_stat 0x40000008
ata6.00: cmd 60/80:18:00:af:f3/00:00:73:00:00/40 tag 3 ncq 65536 in
         res 41/40:00:52:af:f3/0e:00:73:00:00/40 Emask 0x409 (media error) <F>
ata6.00: status: { DRDY ERR }
ata6.00: error: { UNC }
ata6.00: configured for UDMA/133
ata6: EH complete
sd 5:0:0:0: [sde] 1953525168 512-byte hardware sectors (1000205 MB)
sd 5:0:0:0: [sde] Write Protect is off
sd 5:0:0:0: [sde] Mode Sense: 00 3a 00 00
sd 5:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
ata6.00: exception Emask 0x0 SAct 0xb SErr 0x0 action 0x0
ata6.00: irq_stat 0x40000008
ata6.00: cmd 60/80:00:00:af:f3/00:00:73:00:00/40 tag 0 ncq 65536 in
         res 41/40:00:52:af:f3/0e:00:73:00:00/40 Emask 0x409 (media error) <F>
ata6.00: status: { DRDY ERR }
ata6.00: error: { UNC }
ata6.00: configured for UDMA/133
sd 5:0:0:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
sd 5:0:0:0: [sde] Sense Key : Medium Error [current] [descriptor]
Descriptor sense data with sense descriptors (in hex):
        72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
        73 f3 af 52
sd 5:0:0:0: [sde] Add. Sense: Unrecovered read error - auto reallocate failed
end_request: I/O error, dev sde, sector 1945349970
ata6: EH complete
sd 5:0:0:0: [sde] 1953525168 512-byte hardware sectors (1000205 MB)
sd 5:0:0:0: [sde] Write Protect is off
sd 5:0:0:0: [sde] Mode Sense: 00 3a 00 00
sd 5:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
ata6.00: exception Emask 0x0 SAct 0x7 SErr 0x0 action 0x0
ata6.00: irq_stat 0x40000008
ata6.00: cmd 60/00:08:80:b2:f3/01:00:73:00:00/40 tag 1 ncq 131072 in
         res 41/40:00:53:b3:f3/0e:00:73:00:00/40 Emask 0x409 (media error) <F>
ata6.00: status: { DRDY ERR }
ata6.00: error: { UNC }
ata6.00: configured for UDMA/133
ata6: EH complete
sd 5:0:0:0: [sde] 1953525168 512-byte hardware sectors (1000205 MB)
sd 5:0:0:0: [sde] Write Protect is off
sd 5:0:0:0: [sde] Mode Sense: 00 3a 00 00
sd 5:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
ata6.00: exception Emask 0x0 SAct 0x7 SErr 0x0 action 0x0
ata6.00: irq_stat 0x40000008
ata6.00: cmd 60/00:08:80:b2:f3/01:00:73:00:00/40 tag 1 ncq 131072 in
         res 41/40:00:53:b3:f3/0e:00:73:00:00/40 Emask 0x409 (media error) <F>
ata6.00: status: { DRDY ERR }
ata6.00: error: { UNC }
ata6.00: configured for UDMA/133
ata6: EH complete
sd 5:0:0:0: [sde] 1953525168 512-byte hardware sectors (1000205 MB)
sd 5:0:0:0: [sde] Write Protect is off
sd 5:0:0:0: [sde] Mode Sense: 00 3a 00 00
sd 5:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
ata6.00: exception Emask 0x0 SAct 0x6 SErr 0x0 action 0x0
ata6.00: irq_stat 0x40000008
ata6.00: cmd 60/00:08:80:b2:f3/01:00:73:00:00/40 tag 1 ncq 131072 in
         res 41/40:00:53:b3:f3/0e:00:73:00:00/40 Emask 0x409 (media error) <F>
ata6.00: status: { DRDY ERR }
ata6.00: error: { UNC }
ata6.00: configured for UDMA/133
ata6: EH complete
sd 5:0:0:0: [sde] 1953525168 512-byte hardware sectors (1000205 MB)
sd 5:0:0:0: [sde] Write Protect is off
sd 5:0:0:0: [sde] Mode Sense: 00 3a 00 00
sd 5:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
ata6.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x0
ata6.00: irq_stat 0x40000008
ata6.00: cmd 60/00:08:80:b2:f3/01:00:73:00:00/40 tag 1 ncq 131072 in
         res 41/40:00:53:b3:f3/0e:00:73:00:00/40 Emask 0x409 (media error) <F>
ata6.00: status: { DRDY ERR }
ata6.00: error: { UNC }
ata6.00: configured for UDMA/133
ata6: EH complete
sd 5:0:0:0: [sde] 1953525168 512-byte hardware sectors (1000205 MB)
sd 5:0:0:0: [sde] Write Protect is off
sd 5:0:0:0: [sde] Mode Sense: 00 3a 00 00
sd 5:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
ata6.00: exception Emask 0x0 SAct 0x7 SErr 0x0 action 0x0
ata6.00: irq_stat 0x40000008
ata6.00: cmd 60/00:00:80:b2:f3/01:00:73:00:00/40 tag 0 ncq 131072 in
         res 41/40:00:53:b3:f3/0e:00:73:00:00/40 Emask 0x409 (media error) <F>
ata6.00: status: { DRDY ERR }
ata6.00: error: { UNC }
ata6.00: configured for UDMA/133
ata6: EH complete
sd 5:0:0:0: [sde] 1953525168 512-byte hardware sectors (1000205 MB)
sd 5:0:0:0: [sde] Write Protect is off
sd 5:0:0:0: [sde] Mode Sense: 00 3a 00 00
sd 5:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
ata6.00: exception Emask 0x0 SAct 0xf SErr 0x0 action 0x0
ata6.00: irq_stat 0x40000008
ata6.00: cmd 60/00:10:80:b2:f3/01:00:73:00:00/40 tag 2 ncq 131072 in
         res 41/40:00:53:b3:f3/0e:00:73:00:00/40 Emask 0x409 (media error) <F>
ata6.00: status: { DRDY ERR }
ata6.00: error: { UNC }
ata6.00: configured for UDMA/133
sd 5:0:0:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
sd 5:0:0:0: [sde] Sense Key : Medium Error [current] [descriptor]
Descriptor sense data with sense descriptors (in hex):
        72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
        73 f3 b3 53
sd 5:0:0:0: [sde] Add. Sense: Unrecovered read error - auto reallocate failed
end_request: I/O error, dev sde, sector 1945350995
ata6: EH complete
sd 5:0:0:0: [sde] 1953525168 512-byte hardware sectors (1000205 MB)
sd 5:0:0:0: [sde] Write Protect is off
sd 5:0:0:0: [sde] Mode Sense: 00 3a 00 00
sd 5:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
__ratelimit: 22 callbacks suppressed
raid5:md0: read error corrected (8 sectors at 1945348864 on sde)
raid5:md0: read error corrected (8 sectors at 1945348872 on sde)
raid5:md0: read error corrected (8 sectors at 1945348880 on sde)
raid5:md0: read error corrected (8 sectors at 1945348888 on sde)
raid5:md0: read error corrected (8 sectors at 1945348896 on sde)
raid5:md0: read error corrected (8 sectors at 1945348904 on sde)
raid5:md0: read error corrected (8 sectors at 1945348912 on sde)
raid5:md0: read error corrected (8 sectors at 1945348920 on sde)
raid5:md0: read error corrected (8 sectors at 1945348928 on sde)
raid5:md0: read error corrected (8 sectors at 1945348936 on sde)
 CIFS VFS: Error connecting to socket. Aborting operation
 CIFS VFS: cifs_mount failed w/return code = -111
 CIFS VFS: Error connecting to socket. Aborting operation
 CIFS VFS: cifs_mount failed w/return code = -111
 CIFS VFS: Error connecting to socket. Aborting operation
 CIFS VFS: cifs_mount failed w/return code = -111

Is it safe to assume that /dev/sde (part of md0) is dying? Or is it some kind of logical error?

Smart data is as follows: (seems to indicate all is ok, but I know smart is not always 100% accurate)

Code:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       4
  3 Spin_Up_Time            0x0027   173   172   021    Pre-fail  Always       -       4308
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       43
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   094   094   000    Old_age   Always       -       4487
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       42
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       27
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       15
194 Temperature_Celsius     0x0022   118   100   000    Old_age   Always       -       29
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       13

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      3439         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Paperlantern · Aug 11, 2012

Just curious what this shows?

Code:

mdadm --detail /dev/md0

Red Squirrel · Aug 11, 2012

Shows all is fine, so it did not fail it yet.

Code:

[root@borg ~]# mdadm --detail /dev/md0
/dev/md0:
        Version : 0.90
  Creation Time : Sat Sep 20 02:15:28 2008
     Raid Level : raid5
     Array Size : 4883799680 (4657.55 GiB 5001.01 GB)
  Used Dev Size : 976759936 (931.51 GiB 1000.20 GB)
   Raid Devices : 6
  Total Devices : 7
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Sat Aug 11 22:23:09 2012
          State : clean
 Active Devices : 6
Working Devices : 7
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : 11f961e7:0e37ba39:2c8a1552:76dd72ee
         Events : 0.1437284

    Number   Major   Minor   RaidDevice State
       0       8       96        0      active sync   /dev/sdg
       1       8       16        1      active sync   /dev/sdb
       2       8       32        2      active sync   /dev/sdc
       3       8       48        3      active sync   /dev/sdd
       4       8       64        4      active sync   /dev/sde
       5       8       80        5      active sync   /dev/sdf

       6       8      112        -      spare   /dev/sdh
[root@borg ~]#

Paperlantern · Aug 11, 2012

Weird, I've never seen a decimal value in events. Usually it's a whole number, sometimes 5 digits, that's the nly anomaly I see, everything looks good. Have you tried just unmounting/ remounting or just a reboot?

Nothinman · Aug 12, 2012

Code:

res 41/40:00:51:ab:f3/0e:00:73:00:00/40 Emask 0x409 (media error)

Yes, media error usually means bad sectors or some other physical problem with the drive.

MrColin · Aug 12, 2012

Code:

ata6.00: status: { DRDY ERR }
ata6.00: error: { UNC }

From experience I recommend you double check your backups and retire this HDD.

Red Squirrel · Aug 12, 2012

Thought so, think I'll go ahead and retire it and force a rebuild on the hot spare. It's less than a year old, so it might actually be under warranty which will be a bonus. I'll just have to do some stress testing on it so I can make it generate smart errors. I don't think they'll accept a warranty otherwise.

Red Squirrel · Aug 13, 2012

I ran a smart test again, this time it failed with a read error. So that confirms it. Forced it out of the raid and let it rebuild with the spare, pulled it out now and opened an RMA. What's odd though is I ran another backup job last night and it never errored out once. Guess it really depends what part of the drive that gets hit and since it's raid 5 it's possible to never touch that spot even if the data it needs is there as it's also on another drive.

I'm also impressed with WD RMA support, they have an option to send the replacement first. Saves me from waiting for shipment both ways (which adds up to a month or so) and also saves me from trying to improvise packaging material or buy some as I can just reuse the same box.

Fallen Kell · Aug 14, 2012

I was just going to say that the smart data you posted earlier already showed it was dying:

1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always

Google wrote a research paper on this. These two values were two of the most reliable predictive measurements that a drive is failing or about to fail. Drives which show their the first reallocation sector was used (i.e. the reallocated_sector_ct > 0) were 1400% more likely to die in the next 60 days. A little over 70% of drives which have a read error rate greater than zero die within 6 months.

Crusty · Aug 14, 2012

Fallen Kell said:
I was just going to say that the smart data you posted earlier already showed it was dying:

1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always

Google wrote a research paper on this. These two values were two of the most reliable predictive measurements that a drive is failing or about to fail. Drives which show their the first reallocation sector was used (i.e. the reallocated_sector_ct > 0) were 1400% more likely to die in the next 60 days. A little over 70% of drives which have a read error rate greater than zero die within 6 months.

Do you have a link to the paper? I'd love to read it, I've always assumed as long as there were spare sectors to be used still the drive was fine, but empirical evidence from Google would be awesome.

Red Squirrel · Aug 14, 2012

Fallen Kell said:
I was just going to say that the smart data you posted earlier already showed it was dying:

1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always

Google wrote a research paper on this. These two values were two of the most reliable predictive measurements that a drive is failing or about to fail. Drives which show their the first reallocation sector was used (i.e. the reallocated_sector_ct > 0) were 1400% more likely to die in the next 60 days. A little over 70% of drives which have a read error rate greater than zero die within 6 months.

I used to think it was bad, but was told at one point to not worry about that. So it really is bad then? I have a couple other of the drives reporting raw_read_error_rate as a number other than 0.

Code:

sdc
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       2
  3 Spin_Up_Time            0x0027   177   174   021    Pre-fail  Always       -       4108
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       34
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   097   096   000    Old_age   Always       -       2595
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       32
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       22
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       13
194 Temperature_Celsius     0x0022   118   099   000    Old_age   Always       -       29
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0




sdd:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       3
  3 Spin_Up_Time            0x0027   174   172   021    Pre-fail  Always       -       4266
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       34
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   097   095   000    Old_age   Always       -       2621
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       32
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       22
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       13
194 Temperature_Celsius     0x0022   118   105   000    Old_age   Always       -       29
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1



sdf:

  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       2
  3 Spin_Up_Time            0x0027   177   175   021    Pre-fail  Always       -       4133
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       25
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   098   097   000    Old_age   Always       -       1732
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       23
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       17
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       9
194 Temperature_Celsius     0x0022   114   104   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       7



sdg


  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       1
  3 Spin_Up_Time            0x0027   174   173   021    Pre-fail  Always       -       4258
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       42
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   096   095   000    Old_age   Always       -       3168
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       40
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       27
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       16
194 Temperature_Celsius     0x0022   115   100   000    Old_age   Always       -       32
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

So are all these drives bad? What would cause so many drives to be failing? It's actually like the 3rd time I end up having to replace all the drives in this server. Could the controllers be killing them? I was only able to find 2 port sata controllers at the time I built the server so I have a bunch of 2 port as well as using the motherboard ones.

Also could big temp swings be an issue for drives? They tend to mostly run around 30 in summer but in winter they'll run at like 15-20. I don't really condition the temperature in the server area as at this point it's not enclosed, it's basically a rack in the basement.

lxskllr · Aug 14, 2012

Red Squirrel said:
Also could big temp swings be an issue for drives? They tend to mostly run around 30 in summer but in winter they'll run at like 15-20. I don't really condition the temperature in the server area as at this point it's not enclosed, it's basically a rack in the basement.

With most things in life, stable temperature is more important than proper temperature. In you're case, it's not like the system's getting shocked with highly variable extremes. It slowly gets colder as winter comes, and slowly gets warmer for summer. I don't know about HDs specifically, but that wouldn't be a concern to me at all.

Nothinman · Aug 14, 2012

Red Squirrel said:
I used to think it was bad, but was told at one point to not worry about that. So it really is bad then? I have a couple other of the drives reporting raw_read_error_rate as a number other than 0.

The raw read error rate is the number of sectors that have had issues with read operations. This alone doesn't mean the drive is dying, but I would watch a drive more closely if it said anything other than zero. Basically it means that many sectors failed to return data or returned corrupted data. Sometimes rewriting the sector will fix it and sometimes the firmware will just give up and remap it to one of the spares.

The reallocated sector count is the number of sectors that have already been marked as bad over the internal spare count. A replacement should be found as soon as possible any time this goes above zero.

Red Squirrel · Aug 14, 2012

Hmm good to know. I'll definitely have to keep a closer eye on those values then, and think I will consider replacing those other drives as well. Is it normal to get this many or could there be something wrong with my environment/setup? It is quite dusty in the basement and there are lot of spiders, though the drive I pulled out did not really look all that bad. The front of the server has a filter.

Can a controller actually cause disk issues? I can't see how it could cause sector issues though. What about vibration? They are in a removable disk chassis setup and are close together. I wonder if I almost need to be looking at enterprise drives. I hate to overpay if I don't have to though.

Nothinman · Aug 14, 2012

Red Squirrel said:
Hmm good to know. I'll definitely have to keep a closer eye on those values then, and think I will consider replacing those other drives as well. Is it normal to get this many or could there be something wrong with my environment/setup? It is quite dusty in the basement and there are lot of spiders, though the drive I pulled out did not really look all that bad. The front of the server has a filter.

Can a controller actually cause disk issues? I can't see how it could cause sector issues though. What about vibration? They are in a removable disk chassis setup and are close together. I wonder if I almost need to be looking at enterprise drives. I hate to overpay if I don't have to though.

Enterprise drives are usually rated with a significantly MTBF, but the cost difference is usually more significant. I would guess that heat is the primary cause of early drive failure, but if you have a significant amount of vibration I could see that being problematic as well. Basically I would just suggest you do your best to keep both at a minimum.

lxskllr · Aug 14, 2012

Crusty said:
Do you have a link to the paper? I'd love to read it, I've always assumed as long as there were spare sectors to be used still the drive was fine, but empirical evidence from Google would be awesome.

Nothinman said:
Basically I would just suggest you do your best to keep both at a minimum.

I thought I remembered them saying heat wasn't an issue.That's confirmed by this page...

One of the most intriguing findings is the relationship between drive temperature and drive mortality. The Google team took temperature readings from SMART records every few minutes for the nine-month period. As the figure here shows, failure rates do not increase when the average temperature increases. At very high temperatures there is a negative effect, but even that is slight. Heres the graph from the paper:

http://storagemojo.com/2007/02/19/googles-disk-failure-experience/

Link to pdf...

http://research.google.com/archive/disk_failures.pdf

earthman · Aug 15, 2012

Smart pass/fail is just set of parameters that are either under or over predetermined values. You can have a drive on the verge of failure that still passes smart checks because no one value has exceeded it's limits. It's not normal for drives to be reallocating many sectors, or to have many read failures, but a few may show up on a drive that has a lot of hours on it. What you have to watch out for is a lot of adjacent or near adjacent sectors failing, or a lot failinng quickly. I've got one that's got a few dozen reallocations, but it's only added a couple a month, and it still passes smart, and is not predicted to fail, so I'll keep it in use for now.

Nothinman · Aug 15, 2012

lxskllr said:
I thought I remembered them saying heat wasn't an issue.That's confirmed by this page...

http://storagemojo.com/2007/02/19/googles-disk-failure-experience/

Link to pdf...

http://research.google.com/archive/disk_failures.pdf

Interesting, I hadn't read that.

earthman said:
Smart pass/fail is just set of parameters that are either under or over predetermined values. You can have a drive on the verge of failure that still passes smart checks because no one value has exceeded it's limits. It's not normal for drives to be reallocating many sectors, or to have many read failures, but a few may show up on a drive that has a lot of hours on it. What you have to watch out for is a lot of adjacent or near adjacent sectors failing, or a lot failinng quickly. I've got one that's got a few dozen reallocations, but it's only added a couple a month, and it still passes smart, and is not predicted to fail, so I'll keep it in use for now.

True, you shouldn't just blindly believe them but I'd rather play it safe and replace any drive that's had more than a dozen or so reallocations. Especially if it's growing at any rate, even by 1 or 2 a week or month.

Is this drive dying?

Red Squirrel

No Lifer

Paperlantern

Platinum Member

Red Squirrel

No Lifer

Paperlantern

Platinum Member

Nothinman

Elite Member

MrColin

Platinum Member

Red Squirrel

No Lifer

Red Squirrel

No Lifer

Fallen Kell

Diamond Member

Crusty

Lifer

Red Squirrel

No Lifer

lxskllr

No Lifer

Nothinman

Elite Member

Red Squirrel

No Lifer

Nothinman

Elite Member

lxskllr

No Lifer

earthman

Golden Member

Nothinman

Elite Member

TRENDING THREADS