July 2013 Archives

dao had a hung disk.  a hard reboot was required. 

It's unclear which disks were the problem.   will look into it after sleep.


[Mon Jul 29 15:20:55 2013]ata1.00: exception Emask 0x0 SAct 0x1ff SErr 0x0 actio
n 0x0
ata1.00: irq_stat 0x40000008
ata1.00: cmd 60/0f:00:d9:09:39/00:00:33:00:00/40 tag 0 ncq 7680 in
         res 41/40:00:e4:09:39/ff:00:33:00:00/40 Emask 0x409 (media error) <F>
ata1.00: status: { DRDY ERR }
ata1.00: error: { UNC }
Jul 29 07:32:59 dao kernel: ata1.00: exception Emask 0x0 SAct 0x1ff SErr 0x0 action 0x0^M
SCSI device sda: 976773168 512-byte hdwr sectors (500108 MB)
sda: Write Protect is off
SCSI device sda: drive cache: write back
[Mon Jul 29 15:20:57 2013]ata1.00: exception Emask 0x0 SAct 0x1fe SErr 0x0 action 0x0
ata1.00: irq_stat 0x40000008
ata1.00: cmd 60/0f:40:d9:09:39/00:00:33:00:00/40 tag 8 ncq 7680 in
         res 41/40:00:e4:09:39/ff:00:33:00:00/40 Emask 0x409 (media error) <F>
ata1.00: status: { DRDY ERR }
ata1.00: error: { UNC }
SCSI device sda: 976773168 512-byte hdwr sectors (500108 MB)
sda: Write Protect is off
SCSI device sda: drive cache: write back
ata1.00: exception Emask 0x0 SAct 0xff SErr 0x0 action 0x0
[Mon Jul 29 15:21:00 2013]ata1.00: irq_stat 0x40000008
ata1.00: cmd 60/0f:00:d9:09:39/00:00:33:00:00/40 tag 0 ncq 7680 in
         res 41/40:00:e4:09:39/ff:00:33:00:00/40 Emask 0x409 (media error) <F>
ata1.00: status: { DRDY ERR }
ata1.00: error: { UNC }
SCSI device sda: 976773168 512-byte hdwr sectors (500108 MB)
sda: Write Protect is off
SCSI device sda: drive cache: write back
ata1.00: exception Emask 0x0 SAct 0xff SErr 0x0 action 0x0
ata1.00: irq_stat 0x40000008
:



so it's sda.

Smart confirms it

 

it hung again.  killing sda.

# 1  Short offline       Completed: read failure       10%     13900         859376100



so yeah, uh,

16:43 <+nb> INFO: task blkback.16.xvda:12340 blocked for more than 120 seconds.
16:44 <+nb> ata3.00: exception Emask 0x0 SAct 0x7 SErr 0x0 action 0x0
16:44 <+nb> ata3.00: irq_stat 0x40000008
16:44 <+nb> ata3.00: cmd 60/e8:08:e9:91:e5/00:00:08:00:00/40 tag 1 ncq 118784 in
16:44 <+nb>          res 41/40:00:04:92:e5/00:00:08:00:00/40 Emask 0x409 (media
            error) <F>
16:44 <+nb> ata3.00: status: { DRDY ERR }
16:44 <+nb> [Mon Jul 22 16:49:59 2013]ata3.00: error: { UNC }
16:44 <+nb> Jul 22 09:01:51 black kernel: ata3.00: exception Emask 0x0 SAct 0x7
            SErr 0x0 action 0x0
16:44 <+nb> SCSI device sdc: 3907029168 512-byte hdwr sectors (2000399 MB)
16:44 <+nb> sdc: Write Protect is off
16:44 <+nb> SCSI device sdc: drive cache: write back


16:51 < prgmrcom> nb
16:51 < prgmrcom> oh no
16:52 < prgmrcom> gonna reboot it
16:53 < prgmrcom> fuuuck.  and I paid for the expensive disks that aren't
                  sopposed to do that.  I'm pissed.




but yeah.  the upshot here is that one of our disks went bad... in a way that a disk half as expensive would be expected to go bad.   Not a good morning. 



SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     12687         -
# 2  Short offline       Completed without error       00%     12663         -
# 3  Short offline       Completed without error       00%     12641         -
# 4  Conveyance offline  Completed: read failure       10%     12618         188061143
# 5  Short offline       Aborted by host               10%     12617         -
# 6  Short offline       Completed without error       00%     12595         -
# 7  Short offline       Completed: read failure       10%     12570         188061143
# 8  Short offline       Completed without error       00%     12545         -
# 9  Short offline       Completed without error       00%     12523         -
#10  Short offline       Completed without error       00%     12499         -
#11  Extended offline    Completed without error       00%     12487         -
#12  Short offline       Completed without error       00%     12475         -
#13  Short offline       Completed without error       00%     12457         -
#14  Short offline       Completed without error       00%     12357         -
#15  Short offline       Completed without error       00%     12334         -
#16  Extended offline    Completed without error       00%     12320         -
#17  Short offline       Completed without error       00%     12310         -
#18  Short offline       Completed without error       00%     12287         -
#19  Short offline       Completed without error       00%     12264         -
#20  Short offline       Completed without error       00%     12241         -
#21  Short offline       Completed without error       00%     12221         -



so yeah, I thought I remembered smart errors on sdc (which was the problem in this case)  my plan was to leave the drive in until I bought a replacement, which was clearly a mistake.   yanking the drive and heading to central right now.  

About this Archive

This page is an archive of entries from July 2013 listed from newest to oldest.

June 2013 is the previous archive.

October 2013 is the next archive.

Find recent content on the main index or look in the archives to find all content.