Consider a small business that employs predominantly Windows PCs. The business doesn’t have a lot of cash to throw at hardware or software, but still wants to maximise business continuity by minimising system downtime. A couple of FreeNAS servers are considered as file servers; one active, the other on cold standby (ie. no automatic switching). Replication has been set up between the two servers, but the business understands there is the potential to lose any data created since the last replication event. To keep the cost down, local authentication rather than directory services are employed. The servers are hardened in the sense they meet FreeNAS basic hardware requirements and ZFS RAIDZ2 has been implemented.
For the sake of simplicity, assume identical hardware and pool configurations for both the active server and the backup server. To keep things simple, jails, plugins and VMs have been excluded from the study.
Excluding environmental factors (theft, fire, extended loss of power, acts of God, etc), there are just two worst case scenarios that take out the active server:
- Failure of the hardware supporting the pool; or,
- Catastrophic pool failure.
Failure of hardware supporting the pool
If it’s evident there is a hardware failure eg PSU failure, the quickest course of action to bring services back online will be to move the boot device and disks from the failed active server into the backup server and then bring up the backup server as the new active server. Care is required to ensure that the pool is not inadvertently destroyed through mishandling of the disks and boot devices. The key steps to follow are:
- Shut down the backup server
- Remove and store the boot device and disks from the backup server.
- If it is not already shut down, shut down the failed active server.
- Remove the boot device and disks from the failed active server and install them in the backup server.
- Boot the backup server. It now becomes the active server. At this point, users are able to access data on the server.
Following hardware repairs on the failed active server:
- Install the boot device and disks from the backup server into the repaired server.
- Boot the server. What was the failed active server is now the backup server.
- Confirm that data replication is occurring.
The system is unavailable during the initial intervention. All users are affected. However, downtime should typically be no more than about 15 mins. There is no loss of server data.
Catastrophic pool failure
The first step is determining the severity of the pool failure. During the investigation, users can be directed to the backup server where they can access their data in read-only mode. Virtual namespace services such DFS would make this step more transparent. If the failed pool can be repaired with minimal or no data loss, once the pool is restored, users are then directed back to the active server.
If it’s evident there is a catastrophic failure of the active server pool, it will then be necessary to switch over to use the backup server pool. The key steps to follow are:
- Shut down both servers.
- Remove and store the backup server boot device. Replace it with the boot device from the failed active server.
- Boot the backup server.
- Import the pool.
- Change pool state so that it can be written to (zfs set readonly=off). What was the backup server is now the active server. At this point, users are able to access data on the server.
- Set up periodic snapshots.
Attention now turns to the failed active server.
- Install the stored backup server boot device in what was the failed active server.
- Boot the server. What was the active server is now the backup server
- Recreate the pool.
- Set up replication with the active server.
All users are affected. Users are directed to the backup server where their data is available, but in read-only mode and only up till the last replication event. In the event that the failed pool is recovered, users are then redirected back to the active server where offline changes they may have made to data during the intervening period may need to be merged back.
If it is ascertained that there is a catastrophic failure of the active server pool, the system then becomes unavailable while a switch to the backup server pool is underway. Downtime, while this occurs, should be no more than about 15-30 mins. Any data created since the last replication event will be lost.
Catastrophic pool failure is trickier to deal with than failure of the hardware supporting the pool. To minimise data loss and disruption to services involving catastrophic pool failure, several important tasks need to be undertaken/reviewed during normal system operation.
- If directory services are not employed, users and groups need to be manually created on the backup server with matching UIDs and GIDs.
- Share points on the backup server need to be created in advance.
- Check that the data on the backup server is read-only (zfs set readonly=on).
- To minimise data loss, carefully review the frequency of periodic snapshots and ensure replication is occurring during periods of user activity.
What was a surprising finding in working through this small business scenario, is that when there is a failure of the active server, whether it is the hardware or pool, the boot device of the failed active server is switched across to the backup server. Switching the software state of the backup server’s boot device from backup to active is not the preferred default action. Doing so, it appears, complicates returning to the status quo.