My one zpool has experienced two successive drive failures. As I was resilvering the first, the second failed and I got two errors, in snapshots. The resilvering finished, and then I used "zpool replace" to resilver the second faulty drive.
The pool is mounted, all data safe and available except for the two files:
pool: gggpool
state: DEGRADED
status: One or more devices has experienced an error resulting in data corruption. Applications may be affected.
scan: resilvered 2,35T in 19h29m with 5 errors on Sat Sep 21 03:08:24 2013
config:
NAME STATE READ WRITE CKSUM
gggpool DEGRADED 0 0 5 raidz1-0 DEGRADED 0 0 10 scsi-SATA_ST3000DM001-9YN_Z1F0NJKS ONLINE 0 0 0 scsi-SATA_ST3000DM001-9YN_Z1F0RPKE ONLINE 0 0 0 scsi-SATA_ST3000DM001-9YN_Z1F0RPZG ONLINE 0 0 0 scsi-SATA_ST3000DM001-9YN_Z1F0RQJ2 ONLINE 0 0 0 scsi-SATA_ST3000DM001-9YN_Z1F0RQSV ONLINE 0 0 0 scsi-SATA_ST3000DM001-9YN_Z1F0T6VN ONLINE 0 0 0 spare-6 DEGRADED 0 0 0 scsi-SATA_WDC_WD30EZRX-00_WD-WMC1T4095404 UNAVAIL 0 0 0 scsi-SATA_ST3000DM001-9YN_Z1F118BA ONLINE 0 0 0 replacing-7 UNAVAIL 0 0 0 scsi-SATA_ST3000DM001-1CH_Z1F2Z9VC UNAVAIL 0 0 0 scsi-SATA_ST3000DM001-1CH_Z1F2Z8SM ONLINE 0 0 0
spares scsi-SATA_ST3000DM001-9YN_Z1F118BA INUSE currently in useThe remaining errors probably point to where the faulty files were - I destroyed the relevant snapshots but these error indications remain:
errors: Permanent errors have been detected in the following files: <0x218>:<0x7308> <0x3a0>:<0x295a6b>I am not worried about these errors. I am trying to detach the two failed drives, both of which has been replaced, but zpool doesn't do it:
root@ggg:~# zpool detach gggpool scsi-SATA_ST3000DM001-1CH_Z1F2Z9VC
cannot detach scsi-SATA_ST3000DM001-1CH_Z1F2Z9VC: no valid replicas
root@ggg:~# zpool detach gggpool scsi-SATA_WDC_WD30EZRX-00_WD-WMC1T4095404
cannot detach scsi-SATA_WDC_WD30EZRX-00_WD-WMC1T4095404: no valid replicasThe two drives have been physically removed from the array - sent in for warranty replacement - but they live on in the zpool configuration. How do I get rid of them?
When reading data from the pool, I can see the "replacing-7" vdev is not active:
capacity operations bandwidth
pool alloc free read write read write
----------------------------------------------- ----- ----- ----- ----- ----- -----
gggpool 19,8T 1,96T 323 0 36,8M 0 raidz1 19,8T 1,96T 323 0 36,8M 0 scsi-SATA_ST3000DM001-9YN_Z1F0NJKS - - 177 0 5,42M 0 scsi-SATA_ST3000DM001-9YN_Z1F0RPKE - - 184 0 5,26M 0 scsi-SATA_ST3000DM001-9YN_Z1F0RPZG - - 183 0 5,55M 0 scsi-SATA_ST3000DM001-9YN_Z1F0RQJ2 - - 183 0 5,25M 0 scsi-SATA_ST3000DM001-9YN_Z1F0RQSV - - 180 0 5,39M 0 scsi-SATA_ST3000DM001-9YN_Z1F0T6VN - - 181 0 5,21M 0 spare - - 298 0 5,47M 0 scsi-SATA_WDC_WD30EZRX-00_WD-WMC1T4095404 - - 0 0 0 0 scsi-SATA_ST3000DM001-9YN_Z1F118BA - - 230 0 5,49M 0 replacing - - 0 0 0 0 scsi-SATA_ST3000DM001-1CH_Z1F2Z9VC - - 0 0 0 0 scsi-SATA_ST3000DM001-1CH_Z1F2Z8SM - - 0 0 0 0
----------------------------------------------- ----- ----- ----- ----- ----- -----This is worrying because without this VDEV working, the pool has no redundancy - yet I cannot remove or detach any of its two drives. I am in the process of making a full backup - only a day to go. However, destroying this pool and rebuilding it will cause a LOT of headaches, with many filesystems and smb and afs shared having to be re-set up.
And ideas how I can force this failed replacing-7 vdev to work again?
1 Answer
SOLVED
Steps:
- destroy all the snapshots containing the errors
Then issue this:
zpool online gggpool [drive in 'spare' or 'rebuilding' that says online but is not really online]- this starts a resilver process on all vdevs that needs to resilver.
Wait for resilvering to finish; Vdevs will then all indicate "online" in stead of "degraded".
Finally, detach the stubborn removed disks:
zpool detach gggpool [unavailable drive]All pools healthy.