ZFS Quick

From Devpit
Jump to: navigation, search

This page is a good quick reference for typical operational procedures for running FreeBSD on ZFS.


Fixing

Inspection commands

  • zpool status
  • top (look at ARC)
  • iftop (pkg install iftop)
  • gstat
  • iostat -xz 1
  • vmstat 1
  • systat -ifstat
  • systat -vmstat
  • zpool iostat 1
  • gpart show [-l]
  • trafshow (pkg install trafshow)
  • procstat
  • Show block alignment: zdb -C | fgrep ashift
  • See also http://www.brendangregg.com/blog/2015-03-06/performance-analysis-bsd.html
  • "camcontrol identify ada0" can help identify disks. Camcontrol has lots of other commands too; read its man page.
  • smartctl (pkg install smartmontools) can inspect SMART data and run the disk's SMART tests. For example, "smartctl -a /dev/ada0" will list all SMART info.

Boot problems

If boot fails, boot from installation media and try the following:

  • Check that the zpool's bootfs property is set correctly. The command for this example is "zpool set bootfs=tank/. tank".
  • Reload gptzfsboot (not zfsboot or another) in the freebsd-boot partition on each disk. The command is "gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i ? ada0" replacing ? with the partition number of the freebsd-boot partition (not the ZFS partition, or you'll toast it!).
  • Check /boot/loader.conf specifies zfs_load="YES". If the kernel boots but can't find the root filesystem, this is likely missing.
  • Check /etc/rc.conf specifies zfs_enable="YES". If the system boots but doesn't mount any filesystems besides root, this is likely missing.
  • DO NOT set recordsize > 128 KB on the filesystem containing /boot. (At least on FreeBSD 11.0 or before.) It will seem to work at first, until some day after you run freebsd-upgrade and new boot files are written with big blocks. Then gptzfsboot will fail with the error "blocks larger than 128k are not supported". To fix: boot from installation media, import the zpool, use "zfs set recordsize=128K tank/." to stop the effects on new files, and then rewrite the startup files with: "cp -PRp /boot/ /boot-new/ && mv -i /boot /boot-broken && mv -i /boot-new /boot && rm -rf /boot-broken". It may be surprising that rewriting these files actually gets the boot loader working reliably again even if there are larger blocks remaining outside of /boot, but it worked experimentally at least a few times.
  • Try rebuilding zpool.cache. FreeBSD 10 no longer requires zpool.cache to boot from zfs. However, this file is where the kernel remembers the machine's unique hostid, which zpools to import, and their devices. If someone fiddled with things, it could be goofed up. The easiest approach is to remove /boot/zfs/zpool.cache and reboot. This will automatically create a new zpool.cache with the root zpool imported, after which you'll need to import any additional zpools. Naturally, you'll need to prevent any services that depend on the additional zpools from starting until they're imported. On FreeBSD 9, or if that doesn't work, try the manual method: boot from installation media and use something like "zpool import -f -o altroot=/mnt -o cachefile=/tmp/zpool.cache tank && cp /tmp/zpool.cache /mnt/boot/zfs/".

Replacing a disk

  • Prepare the replacement disk with gpart as shown below.
  • Replace the disk using the command "zpool replace".

Adding a vdev

Adding a vdev is easy enough. You should consider maintaining 4 KB blocks, in case you later replace a disk built on 512 byte sectors with a disk built on 4 KB sectors. This operation cannot be reversed, so be sure you type in right the first time; be especially careful to place the "mirror" or "raidz" keyword correctly.

# sysctl vfs.zfs.min_auto_ashift=12
# zpool add tank mirror /dev/gpt/tank.zfs0 /dev/gpt/tank.zfs1

Recovering from other issues

  • To access a zpool from the FreeBSD Live CD. If you have a failing disk, using "-o readonly=on" will prevent the system from automatically trying to resilver or fix problems. This will improve your odds of copying data off of a failing magnetic disk before it's completely inaccessible.
zpool import -f -o altroot=/mnt [-o readonly=on] tank
  • To rename a zpool. For example, you may need to do this if you put the zpool in another machine for recovery, and it has the same name as the new machine's root pool. The command for renaming may appear temporary, but it is permanent. "guid" can be the old name of the pool if it is unique.
zpool import  # To list importable pools.
zpool import [-o altroot=/mnt] guid newname
  • To reguid a pool. You may need to do this if you cloned a pool by splitting mirrors or cloning disks with dd. The symptom is that if two zpools with the same GUID are in the same system, zpool import won't show the disks because it'll think they're part of the pool that's already imported.

Creating a zpool

  • You may need to remove a software RAID formerly configured by the hardware or BIOS, because FreeBSD supports these and locks the host devices, resulting in a rather mysterious "Operation not permitted" error. The commands you're looking for which will destroy data are:
# graid list -a
Geom name: Intel-4829475d
...
# graid delete Intel-4829475d
  • Prepare each disk's partitions and boot loader. Notes:
    • Nothing stops you from configuring swap on a zpool, but doing so leads to deadlocks.
    • If you don't put any swap space at the beginning, then leave a little unused space at the beginning. It can be handy if boot code grows, etc. A generous 256 MB gap is 0.2% of a 100 GB disk so don't be stingy.
    • Keep the ZFS partitions a bit smaller than your disk in case you RMA a disk and get back one slightly smaller. It happens. Use the slack space for swap.
    • Put swap before zfs partitions. For example, if you later need to create a special /boot partition to work around a cranky BIOS that can't read beyond 2 TB, you will find it handy to have swap there to steal space from. Also it makes swap space slightly faster.
    • Name partitions uniquely or you'll have trouble referencing them by name.
    • Creating the GPT with -n 152 pads the table so the first partition starts at a 4k boundary, regardless of whether the sector size is 512 bytes, 4k, or 4k reporting 512.
    • According to https://wiki.freebsd.org/ZFSTuningGuide , "The caveat about only giving ZFS full devices is a solarism that doesn't apply to FreeBSD. On Solaris write caches are disabled on drives if partitions are handed to ZFS. On FreeBSD this isn't the case."
# gpart create -s gpt -n 152 ada0
# gpart add -i 1 -t freebsd-boot -l tank.boot0 -s 1090 ada0
# gpart add -i 2 -t freebsd-swap -l tank.swap0 -b 524288 -s 64G ada0
# gpart add -i 3 -t freebsd-zfs -l tank.zfs0 ada0
# gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada0

# gpart create -s gpt -n 152 ada1
# gpart add -i 1 -t freebsd-boot -l tank.boot1 -s 1090 ada1
# gpart add -i 2 -t freebsd-swap -l tank.swap1 -b 524288 -s 64G ada1
# gpart add -i 3 -t freebsd-zfs -l tank.zfs1 ada1
# gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada1
  • zpool create
    • At this point, you should consider using the format for 4 KB blocks rather than the sector size of your disks, which may be 4KB or 512 bytes. Using larger blocks than the disk's native sector size is fine, but the reverse incurs a nasty performance penalty. If you later replace (possibly expanding) a disk that uses 512 byte sectors with a newer disk that uses 4 KB sectors, the zpool's performance will suffer. Worse, once a vdev is added to a zpool, either with "zpool create" or "zpool add", its block size cannot be changed and the vdev cannot be removed; the only recourse is to create a replacement zpool and copy the data to it. To force 4 KB blocks, use vfs.zfs.min_auto_ashift=12, shown below. This sysctl only affects the "zpool create" and "zpool add" commands, not existing vdevs or zpools. Use zdb to examine existing pools.
    • For example, to create a simple mirror:
# sysctl vfs.zfs.min_auto_ashift=12
# zpool create -o altroot=/mnt -o autoexpand=on -O mountpoint=/ -O canmount=off -O atime=off -O compression=lz4 -O recordsize=1M -O redundant_metadata=most -O com.sun:auto-snapshot=true [-O aclmode=passthrough] [-O aclinherit=passthrough-x] [-O dedup=on] tank mirror /dev/gpt/tank.zfs0 /dev/gpt/tank.zfs1

or single disk:

# sysctl vfs.zfs.min_auto_ashift=12
# zpool create -o altroot=/mnt -o autoexpand=on -O mountpoint=/ -O canmount=off -O atime=off -O compression=lz4 -O recordsize=1M -O redundant_metadata=most -O com.sun:auto-snapshot=true [-O aclmode=passthrough] [-O aclinherit=passthrough-x] [-O dedup=on] tank /dev/gpt/tank.zfs0

or raidz, multiple mirrors, etc, see man zpool.

Installing/Restoring FreeBSD on ZFS

  • First arrange your root and descendent filesystems. See notes below on notes about filesystem hierarchies.
# zfs create -o refquota=8G -o refreservation=8G -o recordsize=128K tank/.
# zfs create -o refquota=64G -o refreservation=64G tank/mysql
# zfs create -o refquota=64G -o refreservation=64G tank/pgsql
# zfs create tank/home
  • Now that these are (automatically) mounted, load the data. If you're restoring from a backup or cloning another machine, now would be the time to populate the files. Use "tar|nc" or see below for "zfs send|zfs receive". To do a fresh installation, extract the distfiles from the live CD as follows:
[for FreeBSD 10]
# tar -C /mnt -xpf /usr/freebsd-dist/base.txz
# tar -C /mnt -xpf /usr/freebsd-dist/doc.txz
# tar -C /mnt -xpf /usr/freebsd-dist/games.txz
# tar -C /mnt -xpf /usr/freebsd-dist/kernel.txz
# tar -C /mnt -xpf /usr/freebsd-dist/lib32.txz
[Add ports.txz and/or src.txz if needed.]

[for FreeBSD 11]
# tar -C /mnt -xpf /usr/freebsd-dist/base.txz
# tar -C /mnt -xpf /usr/freebsd-dist/doc.txz
# tar -C /mnt -xpf /usr/freebsd-dist/kernel.txz
# tar -C /mnt -xpf /usr/freebsd-dist/lib32.txz
[Add ports.txz src.txz tests.txz if needed.]
  • Check/fix permissions on each mountpoint, especially /tmp, /var/tmp, etc.
# chmod 1777 /mnt/tmp /mnt/var/tmp /mnt/var/tmp/vi.recover
  • Prepare boot config:
# cat >> /mnt/boot/loader.conf << EOF
zfs_load="YES"
#vfs.zfs.arc_max=64M  # Enable on small-memory or 32-bit systems.
#vfs.zfs.vdev.cache.size=8M  # Enable on small-memory or 32-bit systems.
EOF
# cat >> /mnt/etc/rc.conf << EOF
zfs_enable="YES"
EOF
  • If you use swap space, add it to /etc/fstab now. Almost certainly one should encrypt swap by adding .eli to the device names as follows:
# cat >> /mnt/etc/fstab << EOF
/dev/gpt/tank.swap0.eli none swap sw 0 0
/dev/gpt/tank.swap1.eli none swap sw 0 0
EOF
  • Make zpool bootable.
# zpool set bootfs=tank/. tank
  • Reboot.

Send/Receive

To get all preceding snapshots:

zfs send -R sourcetank@sent | zfs receive -Fv desttank

To get only the specified snapshot:

zfs send -p sourcetank@sent | zfs receive -Fv desttank

Send across a network with nc:

zfs send -R sourcetank@sent | nc -N -l 12345
nc -v othermachine 12345 | zfs receive -Fv desttank

Notes on filesystem hierarchies

Every zpool begins with one filesystem which has the same name as the zpool and cannot be deleted or renamed or replaced. The filesystems in the zpool are named in a hierarchy starting with this first filesystem. This hierarchy is not the same as the hierarchy of mountpoints, but is related. The mountpoint for the special first filesystem defaults to the name of it (under /). The default mountpoint of descendent filesystems is the mountpoint of the parent, with the name of the child appended (slash separated). Explicitly changing the mountpoint of a parent will change the default mountpoints of its children. Explicitly changing the name of a parent which is using its default mountpoint will change the parent's mountpoint along with the default mountpoints of its children. Most properties are inherited, and inheritance always follows the filesystem hierarchy, not the mountpoint hierarchy; after all, mountpoints are merely properties to be inherited (with the special treatment of appending). The filesystem hierarchy cannot be sparse, unlike the mountpoint hierarchy; to create tank/var/log, you must create tank/var first. However, you can create tank/var-log and explicitly set its mountpoint property to /var/log while leaving /var as a simple directory in /. Perhaps it's surprising that filesystems can be renamed into other positions in the filesystem hierarchy, and as you might expect, doing so changes the properties that they inherited. The command "zfs get" will let you view which properties are inherited and which are local (meaning explicitly set).

The zpool's special first filesystem has awkward restrictions. There is no command to clear its contents in bulk, and one cannot create a replacement and swap it in. The example creates a descendent filesystem for / and sets the first filesystem's canmount property to off (which is a special property that is not inherited). Thereafter, its existence has no effect except to seed inherited properties of descendents. Although the name of the special first filesystem cannot be changed, the example changes its mountpoint to /. This way the mountpoint of tank/mysql defaults to /mysql as desired. The example shows a clever trick of naming a filesystem "tank/.". This is a perfectly valid name and will result in its inherited mountpoint being "/." which is canonicalized to "/". Because the resulting root mountpoint inherits its mountpoint property rather than setting it explicitly, this trick allows you to use "zfs rename -u" and "reboot" to swap it with another filesystem, if you are brave.

One may desire configuring additional, perhaps for external storage arrays. On the additional pools, it is helpful to again avoid using the default first filesystem for storage, because of its awkward restrictions. A convenient pattern is to configure additional zpools similarly, by setting the first filesystem's mountpoint explicitly to "/" and setting its canmount property to off. Thereafter, any descendent filesystems will infer a reasonable mountpoint based on their name prefixed with "/".

Traditional reasons for separating filesystems are not relevant. Mostly, separating filesystems in the traditional patterns is make-work, and merely impedes the mv command. However, there are new reasons for separate filesystems. You may want to separate a directory to make it easier to back up or replicate: incremental changes can be collected without scanning the whole file hierarchy by using "zfs send -i", which operates only on an entire filesystem. One can guarantee free space for a filesystem ("zfs set refreservation") and limit the size of a filesystem ("zfs set refquota"). Properties such as compression, deduplication, and noatime are set on a whole filesystem. Snapshots are created, destroyed, and rollbacked in units of whole filesystems. Rollbacking the root filesystem without killing all stateful services would cause weird problems, but being able to snapshot and rollback mysql's storage directory can be quite helpful when setting up mysql replication, or keeping checkpoints from which one can replay transaction logs. Keeping periodic snapshots is a fantastic tool, but some data sets are not appropriate to snapshot at all. For example, a filesystem storing Bacula volumes must avoid having snapshots to avoid running out of space, because the volumes are fully rewritten over the course of normal operation, with the expectation that the overwritten space can be reclaimed immediately. If housing jails, putting each jail in its own filesystem is handy for cloning, sending, and rollbacking whole jails without affecting the other jails or the host machine.

Other quick notes:

  • Compression usually speeds up IO. lz4 is the best option for most zpools, with good worst-cases for uncompressible data that are faster than spinning rust.
  • quota is inherited by descendents in the ZFS hierarchy and snapshots, but refquota is not.
  • reservation includes descendents in the ZFS hierarchy in the space reservation, but refreservation does not.
  • ZFS uses NFSv4 ACLs. If you don't care about ACLs, omit aclmode or aclinherit above; this way chmod will remove ACLs.
  • If you add cache devices, consider diddling logbias for logs and databases?
  • utf8only=on seems like a good idea, but I'm unaware of potential consequences. (It specifies to reject filenames that are invalid UTF-8 encodings.)
  • Log devices (also "intent log") store critical data and should be mirrored; cache devices (also "L2ARC") are disposable and cannot be mirrored.

Adding gmirror for swap

  • In this scenario, you probably want to reduce the size of your swap partitions slightly so that if you RMA a disk and get a smaller replacement, you don't have to swapoff, destroy, and rebuild the whole gmirror.
gpart resize -i xxx -s 16G ada0
  • Run:
kldload geom_mirror
gmirror label -F -v tank.swap /dev/gpt/tank.swap0 /dev/gpt/tank.swap1
  • Add to /etc/fstab:
/dev/mirror/tank.swap.eli none swap sw 0 0
  • Add to /boot/loader.conf:
geom_mirror_load="YES"
  • To activate until reboot:
geli onetime -d /dev/mirror/tank.swap
swapon -a

Note: Before 10.1, gmirror lacked the "destroy" subcommand. As far as I can tell, to remove the last device from a mirror, one had to reboot into single user mode or without the geom_mirror module loaded and clear the last sector of each partition involved.

Wizardly tricks

  • Quick aliases:
alias zfsget='zfs get -t filesystem,volume -s local,received,temporary'
alias zfslist='zfs list -o name,mounted,mountpoint,avail,usedbydataset,usedbysnapshots,refcompressratio'
alias zfslistl='zfs list -o name,mounted,canmount,mountpoint,avail,usedbydataset,usedbysnapshots,refcompressratio,refquota,refreservation'
  • Use com.sun:auto-snapshot=true with `pkg add zfstools` to maintain rolling snapshots. Add to /etc/crontab:
0 * * * * root lockf -kst0 /var/run/zfs-auto-snapshot_hourly.lockf zfs-auto-snapshot -ku hourly 168
0 6 * * * root lockf -kst0 /var/run/zfs-auto-snapshot_daily.lockf zfs-auto-snapshot -ku daily 28
0 6 * * Mon root lockf -kst0 /var/run/zfs-auto-snapshot_weekly.lockf zfs-auto-snapshot -ku weekly 53
  • Here's a dandy parlor trick to specify your GPT labels without fiddling with gpart add. For example, consider a 3.6 TiB ("4 TB") disk. (36/10 avoids a floating point result.)
printf "GPT 152\n1 freebsd-boot $((o=40)) $((s=1090)) test.boot0\n2 freebsd-swap $((o+=s)) $((s=1024**3/512*256)) test.swap0\n3 freebsd-zfs $((o+=s)) $((1024**4/512*36/10-o)) test.zfs0\n" | gpart restore -Fl ada0

See Also