There’s a raging debate on the merits of RAID-Z2 over RAID-Z1. Both have their pluses and minuses. RAID-Z2 requires two disks for redundancy and is less attractive for home and small business systems, which generally have no more than about five disks in a RAID configuration. However, as disk capacities grow, for instance, as I write, 10TB disks are available though still pricey, resilvering is taking longer and longer to complete. It can take anywhere from one to two days or more for a resilver to complete on a small system with large capacity disks. RAID-Z1 becomes less attractive as disk capacities grow. This is because the risk of a second disk failing or other unforeseen event occurring during the resilvering process increases the longer the resilver takes.
Personally, I think the debaters are missing the point somewhat. Sure, a level of disk redundancy is important in providing continuity of service and, for larger disk sets, higher levels of RAID should be considered. However, we should not lose sight of what’s really important here, and that is, foremost, avoiding any data loss and secondly, minimising system downtime, therefore, maintaining business continuity as much as possible.
I’ve actually crippled a ZFS volume during a disk replacement on a RAID-Z1 system. The resilver process was progressing well, but, because of the size of the disk being replaced, was taking a very long time to complete. During the period of the resilver, I experienced an extended power outage which my UPS was unable to cope with. For whatever reason, once power was restored, FreeNAS resumed the resilver, but became confused and disoriented and began to gradually corrupt the data. I spent about a week under great duress salvaging my data before I could begin the process of rebuilding the pool and bringing the system back online. I didn’t have a complete data backup at the time as I didn’t have an effective way of automating the backup of 12TB of data.
After this event, I started to think about what was really important in terms of business continuity. It dawned on me that what was more important in my situation was not additional disk redundancy, but pool redundancy. More specifically, it is dataset redundancy as a pool can be recreated by reassembling the individual datasets.
My next post will look at how I began thinking about achieving dataset redundancy in a RAID-Z1 environment.