- Apr 20, 2013
- 4,307
- 450
- 126
Well, I've had my first bad(ish) experience with ZFS and I'm a bit perplexed why it went bad. First off let me say I'm operating under the assumption it's my fault. I also have the box backed up so worst case I could have blown it out and rebuilt it but I'm trying to understand what went wrong so I don't repeat it in the future.
Setup:
Solaris 11 + nappit backing two ESXI 6.0 hosts
Crucial 128Gb MX100's for boot drive and ZIL
16x Seagate SATA spinners for pool in RAIDZ-2
All drives running off an IBM M1015 (LSI controller) in IT mode
Recently I've been noticing severely decreased write performance on the VM's being backed by the pool. Today it reached the point where I decided to dig into it more. vSphere shows command queuing latency meaning it's sending commands to the storage and it's taking too long to get a reply. I try copying a 30Gb file to the file server VM over the network. Initially starts out at ~115MB/s as expected/normal, but after about 15Gb, drops to nothing. Run a dd bench on the ZFS pool and write speed is severely degraded compared to normal. Check disk stats and see the ZIL is showing 70% busy and a wait% despite almost nothing going on with the storage. Turn sync off and speed goes back to normal. So, ZIL drive is toast. This isn't a production environment so I didn't go with a "good" SSD, but that said the MX100 wasn't terribly old either.
So I shutdown the SAN, pull the ZIL, hook it up to my primary PC in a USB enclosure. Crucial Toolbox says the drive is healthy (9TB written), but it is running an old firmware. So I upgrade the firmware. Since it's running as a ZIL, I know GC probably isn't working so I do a secure erase on it to see if I can revive it to previous speeds. This is probably my critical error here, but I'm trying to understand why it caused as many problems as it did.
Put the wiped ZIL drive back in it's original bay and power the SAN back on. Log into napp-it and the box is running slow as balls. Pages that normally load nearly instantly are taking 5+ minutes to load. Pool shows as UNAVAILABLE. Exported the pool so I could reimport with the -m flag. Box hangs on the import. Reboot the box, it hangs on boot (stuck at the spinning wheel). At this point I'm assuming I'll be rebuilding the SAN and restoring from backup. Hard reset the box, it boots normally back into Solaris. zPool is missing but box is otherwise running normally. Response times seem normal. Curious but I press on.
Reimport the pool with -m. It's now showing up as degraded (which is expected). Clear the pool, and it now lets me remove and readd the ZIL. Pool is now showing back online, which is great. However, the ESX hosts this storage is backing isn't seeing the LU's. I check the Solaris box and the LU's are gone. Check the drives and the LU files still exist. Import the LU's, recreate views. Hosts now see the storage back. Power on the VM's and everything is running normally save for the fact write speed is still hosed. Disable sync again and performance returns back to full speed. So ZIL is still hosed.
So, this leaves me two questions.
1) Why did the pool issues have a severe negative impact on OS performance? The OS isn't running off the pool.
2) Would offlining the pool, then removing the ZIL and adding it's replacement have prevented this issue?
Setup:
Solaris 11 + nappit backing two ESXI 6.0 hosts
Crucial 128Gb MX100's for boot drive and ZIL
16x Seagate SATA spinners for pool in RAIDZ-2
All drives running off an IBM M1015 (LSI controller) in IT mode
Recently I've been noticing severely decreased write performance on the VM's being backed by the pool. Today it reached the point where I decided to dig into it more. vSphere shows command queuing latency meaning it's sending commands to the storage and it's taking too long to get a reply. I try copying a 30Gb file to the file server VM over the network. Initially starts out at ~115MB/s as expected/normal, but after about 15Gb, drops to nothing. Run a dd bench on the ZFS pool and write speed is severely degraded compared to normal. Check disk stats and see the ZIL is showing 70% busy and a wait% despite almost nothing going on with the storage. Turn sync off and speed goes back to normal. So, ZIL drive is toast. This isn't a production environment so I didn't go with a "good" SSD, but that said the MX100 wasn't terribly old either.
So I shutdown the SAN, pull the ZIL, hook it up to my primary PC in a USB enclosure. Crucial Toolbox says the drive is healthy (9TB written), but it is running an old firmware. So I upgrade the firmware. Since it's running as a ZIL, I know GC probably isn't working so I do a secure erase on it to see if I can revive it to previous speeds. This is probably my critical error here, but I'm trying to understand why it caused as many problems as it did.
Put the wiped ZIL drive back in it's original bay and power the SAN back on. Log into napp-it and the box is running slow as balls. Pages that normally load nearly instantly are taking 5+ minutes to load. Pool shows as UNAVAILABLE. Exported the pool so I could reimport with the -m flag. Box hangs on the import. Reboot the box, it hangs on boot (stuck at the spinning wheel). At this point I'm assuming I'll be rebuilding the SAN and restoring from backup. Hard reset the box, it boots normally back into Solaris. zPool is missing but box is otherwise running normally. Response times seem normal. Curious but I press on.
Reimport the pool with -m. It's now showing up as degraded (which is expected). Clear the pool, and it now lets me remove and readd the ZIL. Pool is now showing back online, which is great. However, the ESX hosts this storage is backing isn't seeing the LU's. I check the Solaris box and the LU's are gone. Check the drives and the LU files still exist. Import the LU's, recreate views. Hosts now see the storage back. Power on the VM's and everything is running normally save for the fact write speed is still hosed. Disable sync again and performance returns back to full speed. So ZIL is still hosed.
So, this leaves me two questions.
1) Why did the pool issues have a severe negative impact on OS performance? The OS isn't running off the pool.
2) Would offlining the pool, then removing the ZIL and adding it's replacement have prevented this issue?