man/man8/zpoolconcepts.8

   1 .\"
   2 .\" CDDL HEADER START
   3 .\"
   4 .\" The contents of this file are subject to the terms of the
   5 .\" Common Development and Distribution License (the "License").
   6 .\" You may not use this file except in compliance with the License.
   7 .\"
   8 .\" You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
   9 .\" or http://www.opensolaris.org/os/licensing.
  10 .\" See the License for the specific language governing permissions
  11 .\" and limitations under the License.
  12 .\"
  13 .\" When distributing Covered Code, include this CDDL HEADER in each
  14 .\" file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  15 .\" If applicable, add the following below this CDDL HEADER, with the
  16 .\" fields enclosed by brackets "[]" replaced with your own identifying
  17 .\" information: Portions Copyright [yyyy] [name of copyright owner]
  18 .\"
  19 .\" CDDL HEADER END
  20 .\"
  21 .\"
  22 .\" Copyright (c) 2007, Sun Microsystems, Inc. All Rights Reserved.
  23 .\" Copyright (c) 2012, 2018 by Delphix. All rights reserved.
  24 .\" Copyright (c) 2012 Cyril Plisko. All Rights Reserved.
  25 .\" Copyright (c) 2017 Datto Inc.
  26 .\" Copyright (c) 2018 George Melikov. All Rights Reserved.
  27 .\" Copyright 2017 Nexenta Systems, Inc.
  28 .\" Copyright (c) 2017 Open-E, Inc. All Rights Reserved.
  29 .\"
  30 .Dd August 9, 2019
  31 .Dt ZPOOLCONCEPTS 8
  32 .Os
  33 .Sh NAME
  34 .Nm zpoolconcepts
  35 .Nd overview of ZFS storage pools
  36 .Sh DESCRIPTION
  37 .Ss Virtual Devices (vdevs)
  38 A "virtual device" describes a single device or a collection of devices
  39 organized according to certain performance and fault characteristics.
  40 The following virtual devices are supported:
  41 .Bl -tag -width Ds
  42 .It Sy disk
  43 A block device, typically located under
  44 .Pa /dev .
  45 ZFS can use individual slices or partitions, though the recommended mode of
  46 operation is to use whole disks.
  47 A disk can be specified by a full path, or it can be a shorthand name
  48 .Po the relative portion of the path under
  49 .Pa /dev
  50 .Pc .
  51 A whole disk can be specified by omitting the slice or partition designation.
  52 For example,
  53 .Pa sda
  54 is equivalent to
  55 .Pa /dev/sda .
  56 When given a whole disk, ZFS automatically labels the disk, if necessary.
  57 .It Sy file
  58 A regular file.
  59 The use of files as a backing store is strongly discouraged.
  60 It is designed primarily for experimental purposes, as the fault tolerance of a
  61 file is only as good as the file system of which it is a part.
  62 A file must be specified by a full path.
  63 .It Sy mirror
  64 A mirror of two or more devices.
  65 Data is replicated in an identical fashion across all components of a mirror.
  66 A mirror with N disks of size X can hold X bytes and can withstand (N-1) devices
  67 failing without losing data.
  68 .It Sy raidz , raidz1 , raidz2 , raidz3
  69 A variation on RAID-5 that allows for better distribution of parity and
  70 eliminates the RAID-5
  71 .Qq write hole
  72 .Pq in which data and parity become inconsistent after a power loss .
  73 Data and parity is striped across all disks within a raidz group.
  74 .Pp
  75 A raidz group can have single-, double-, or triple-parity, meaning that the
  76 raidz group can sustain one, two, or three failures, respectively, without
  77 losing any data.
  78 The
  79 .Sy raidz1
  80 vdev type specifies a single-parity raidz group; the
  81 .Sy raidz2
  82 vdev type specifies a double-parity raidz group; and the
  83 .Sy raidz3
  84 vdev type specifies a triple-parity raidz group.
  85 The
  86 .Sy raidz
  87 vdev type is an alias for
  88 .Sy raidz1 .
  89 .Pp
  90 A raidz group with N disks of size X with P parity disks can hold approximately
  91 (N-P)*X bytes and can withstand P device(s) failing without losing data.
  92 The minimum number of devices in a raidz group is one more than the number of
  93 parity disks.
  94 The recommended number is between 3 and 9 to help increase performance.
  95 .It Sy draid , draid1 , draid2 , draid3
  96 A variant of raidz that provides integrated distributed hot spares which
  97 allows for faster resilvering while retaining the benefits of raidz.
  98 A dRAID vdev is constructed from multiple internal raidz groups, each with D
  99 data devices and P parity devices.
 100 These groups are distributed over all of the children in order to fully
 101 utilize the available disk performance.
 102 .Pp
 103 Unlike raidz, dRAID uses a fixed stripe width (padding as necessary with
 104 zeros) to allow fully sequential resilvering.
 105 This fixed stripe width significantly effects both usable capacity and IOPS.
 106 For example, with the default D=8 and 4k disk sectors the minimum allocation
 107 size is 32k.
 108 If using compression, this relatively large allocation size can reduce the
 109 effective compression ratio.
 110 When using ZFS volumes and dRAID the default volblocksize property is increased
 111 to account for the allocation size.
 112 If a dRAID pool will hold a significant amount of small blocks, it is
 113 recommended to also add a mirrored
 114 .Sy special
 115 vdev to store those blocks.
 116 .Pp
 117 In regards to IO/s, performance is similar to raidz since for any read all D
 118 data disks must be accessed.
 119 Delivered random IOPS can be reasonably approximated as
 120 floor((N-S)/(D+P))*<single-drive-IOPS>.
 121 .Pp
 122 Like raidz a dRAID can have single-, double-, or triple-parity.  The
 123 .Sy draid1 ,
 124 .Sy draid2 ,
 125 and
 126 .Sy draid3
 127 types can be used to specify the parity level.
 128 The
 129 .Sy draid
 130 vdev type is an alias for
 131 .Sy draid1 .
 132 .Pp
 133 A dRAID with N disks of size X, D data disks per redundancy group, P parity
 134 level, and S distributed hot spares can hold approximately (N-S)*(D/(D+P))*X
 135 bytes and can withstand P device(s) failing without losing data.
 136 .It Sy draid[<parity>][:<data>d][:<children>c][:<spares>s]
 137 A non-default dRAID configuration can be specified by appending one or more
 138 of the following optional arguments to the
 139 .Sy draid
 140 keyword.
 141 .Pp
 142 .Em parity
 143 - The parity level (1-3).
 144 .Pp
 145 .Em data
 146 - The number of data devices per redundancy group.
 147 In general a smaller value of D will increase IOPS, improve the compression ratio, and speed up resilvering at the expense of total usable capacity.
 148 Defaults to 8, unless N-P-S is less than 8.
 149 .Pp
 150 .Em children
 151 - The expected number of children.
 152 Useful as a cross-check when listing a large number of devices.
 153 An error is returned when the provided number of children differs.
 154 .Pp
 155 .Em spares
 156 - The number of distributed hot spares.
 157 Defaults to zero.
 158 .Pp
 159 .Pp
 160 .It Sy spare
 161 A pseudo-vdev which keeps track of available hot spares for a pool.
 162 For more information, see the
 163 .Sx Hot Spares
 164 section.
 165 .It Sy log
 166 A separate intent log device.
 167 If more than one log device is specified, then writes are load-balanced between
 168 devices.
 169 Log devices can be mirrored.
 170 However, raidz vdev types are not supported for the intent log.
 171 For more information, see the
 172 .Sx Intent Log
 173 section.
 174 .It Sy dedup
 175 A device dedicated solely for deduplication tables.
 176 The redundancy of this device should match the redundancy of the other normal
 177 devices in the pool. If more than one dedup device is specified, then
 178 allocations are load-balanced between those devices.
 179 .It Sy special
 180 A device dedicated solely for allocating various kinds of internal metadata,
 181 and optionally small file blocks.
 182 The redundancy of this device should match the redundancy of the other normal
 183 devices in the pool. If more than one special device is specified, then
 184 allocations are load-balanced between those devices.
 185 .Pp
 186 For more information on special allocations, see the
 187 .Sx Special Allocation Class
 188 section.
 189 .It Sy cache
 190 A device used to cache storage pool data.
 191 A cache device cannot be configured as a mirror or raidz group.
 192 For more information, see the
 193 .Sx Cache Devices
 194 section.
 195 .El
 196 .Pp
 197 Virtual devices cannot be nested, so a mirror or raidz virtual device can only
 198 contain files or disks.
 199 Mirrors of mirrors
 200 .Pq or other combinations
 201 are not allowed.
 202 .Pp
 203 A pool can have any number of virtual devices at the top of the configuration
 204 .Po known as
 205 .Qq root vdevs
 206 .Pc .
 207 Data is dynamically distributed across all top-level devices to balance data
 208 among devices.
 209 As new virtual devices are added, ZFS automatically places data on the newly
 210 available devices.
 211 .Pp
 212 Virtual devices are specified one at a time on the command line, separated by
 213 whitespace.
 214 The keywords
 215 .Sy mirror
 216 and
 217 .Sy raidz
 218 are used to distinguish where a group ends and another begins.
 219 For example, the following creates two root vdevs, each a mirror of two disks:
 220 .Bd -literal
 221 # zpool create mypool mirror sda sdb mirror sdc sdd
 222 .Ed
 223 .Ss Device Failure and Recovery
 224 ZFS supports a rich set of mechanisms for handling device failure and data
 225 corruption.
 226 All metadata and data is checksummed, and ZFS automatically repairs bad data
 227 from a good copy when corruption is detected.
 228 .Pp
 229 In order to take advantage of these features, a pool must make use of some form
 230 of redundancy, using either mirrored or raidz groups.
 231 While ZFS supports running in a non-redundant configuration, where each root
 232 vdev is simply a disk or file, this is strongly discouraged.
 233 A single case of bit corruption can render some or all of your data unavailable.
 234 .Pp
 235 A pool's health status is described by one of three states: online, degraded,
 236 or faulted.
 237 An online pool has all devices operating normally.
 238 A degraded pool is one in which one or more devices have failed, but the data is
 239 still available due to a redundant configuration.
 240 A faulted pool has corrupted metadata, or one or more faulted devices, and
 241 insufficient replicas to continue functioning.
 242 .Pp
 243 The health of the top-level vdev, such as mirror or raidz device, is
 244 potentially impacted by the state of its associated vdevs, or component
 245 devices.
 246 A top-level vdev or component device is in one of the following states:
 247 .Bl -tag -width "DEGRADED"
 248 .It Sy DEGRADED
 249 One or more top-level vdevs is in the degraded state because one or more
 250 component devices are offline.
 251 Sufficient replicas exist to continue functioning.
 252 .Pp
 253 One or more component devices is in the degraded or faulted state, but
 254 sufficient replicas exist to continue functioning.
 255 The underlying conditions are as follows:
 256 .Bl -bullet
 257 .It
 258 The number of checksum errors exceeds acceptable levels and the device is
 259 degraded as an indication that something may be wrong.
 260 ZFS continues to use the device as necessary.
 261 .It
 262 The number of I/O errors exceeds acceptable levels.
 263 The device could not be marked as faulted because there are insufficient
 264 replicas to continue functioning.
 265 .El
 266 .It Sy FAULTED
 267 One or more top-level vdevs is in the faulted state because one or more
 268 component devices are offline.
 269 Insufficient replicas exist to continue functioning.
 270 .Pp
 271 One or more component devices is in the faulted state, and insufficient
 272 replicas exist to continue functioning.
 273 The underlying conditions are as follows:
 274 .Bl -bullet
 275 .It
 276 The device could be opened, but the contents did not match expected values.
 277 .It
 278 The number of I/O errors exceeds acceptable levels and the device is faulted to
 279 prevent further use of the device.
 280 .El
 281 .It Sy OFFLINE
 282 The device was explicitly taken offline by the
 283 .Nm zpool Cm offline
 284 command.
 285 .It Sy ONLINE
 286 The device is online and functioning.
 287 .It Sy REMOVED
 288 The device was physically removed while the system was running.
 289 Device removal detection is hardware-dependent and may not be supported on all
 290 platforms.
 291 .It Sy UNAVAIL
 292 The device could not be opened.
 293 If a pool is imported when a device was unavailable, then the device will be
 294 identified by a unique identifier instead of its path since the path was never
 295 correct in the first place.
 296 .El
 297 .Pp
 298 If a device is removed and later re-attached to the system, ZFS attempts
 299 to put the device online automatically.
 300 Device attach detection is hardware-dependent and might not be supported on all
 301 platforms.
 302 .Ss Hot Spares
 303 ZFS allows devices to be associated with pools as
 304 .Qq hot spares .
 305 These devices are not actively used in the pool, but when an active device
 306 fails, it is automatically replaced by a hot spare.
 307 To create a pool with hot spares, specify a
 308 .Sy spare
 309 vdev with any number of devices.
 310 For example,
 311 .Bd -literal
 312 # zpool create pool mirror sda sdb spare sdc sdd
 313 .Ed
 314 .Pp
 315 Spares can be shared across multiple pools, and can be added with the
 316 .Nm zpool Cm add
 317 command and removed with the
 318 .Nm zpool Cm remove
 319 command.
 320 Once a spare replacement is initiated, a new
 321 .Sy spare
 322 vdev is created within the configuration that will remain there until the
 323 original device is replaced.
 324 At this point, the hot spare becomes available again if another device fails.
 325 .Pp
 326 If a pool has a shared spare that is currently being used, the pool can not be
 327 exported since other pools may use this shared spare, which may lead to
 328 potential data corruption.
 329 .Pp
 330 Shared spares add some risk.  If the pools are imported on different hosts, and
 331 both pools suffer a device failure at the same time, both could attempt to use
 332 the spare at the same time.  This may not be detected, resulting in data
 333 corruption.
 334 .Pp
 335 An in-progress spare replacement can be cancelled by detaching the hot spare.
 336 If the original faulted device is detached, then the hot spare assumes its
 337 place in the configuration, and is removed from the spare list of all active
 338 pools.
 339 .Pp
 340 The
 341 .Sy draid
 342 vdev type provides distributed hot spares.
 343 These hot spares are named after the dRAID vdev they're a part of (
 344 .Qq draid1-2-3 specifies spare 3 of vdev 2, which is a single parity dRAID
 345 ) and may only be used by that dRAID vdev.
 346 Otherwise, they behave the same as normal hot spares.
 347 .Pp
 348 Spares cannot replace log devices.
 349 .Ss Intent Log
 350 The ZFS Intent Log (ZIL) satisfies POSIX requirements for synchronous
 351 transactions.
 352 For instance, databases often require their transactions to be on stable storage
 353 devices when returning from a system call.
 354 NFS and other applications can also use
 355 .Xr fsync 2
 356 to ensure data stability.
 357 By default, the intent log is allocated from blocks within the main pool.
 358 However, it might be possible to get better performance using separate intent
 359 log devices such as NVRAM or a dedicated disk.
 360 For example:
 361 .Bd -literal
 362 # zpool create pool sda sdb log sdc
 363 .Ed
 364 .Pp
 365 Multiple log devices can also be specified, and they can be mirrored.
 366 See the
 367 .Sx EXAMPLES
 368 section for an example of mirroring multiple log devices.
 369 .Pp
 370 Log devices can be added, replaced, attached, detached and removed.  In
 371 addition, log devices are imported and exported as part of the pool
 372 that contains them.
 373 Mirrored devices can be removed by specifying the top-level mirror vdev.
 374 .Ss Cache Devices
 375 Devices can be added to a storage pool as
 376 .Qq cache devices .
 377 These devices provide an additional layer of caching between main memory and
 378 disk.
 379 For read-heavy workloads, where the working set size is much larger than what
 380 can be cached in main memory, using cache devices allow much more of this
 381 working set to be served from low latency media.
 382 Using cache devices provides the greatest performance improvement for random
 383 read-workloads of mostly static content.
 384 .Pp
 385 To create a pool with cache devices, specify a
 386 .Sy cache
 387 vdev with any number of devices.
 388 For example:
 389 .Bd -literal
 390 # zpool create pool sda sdb cache sdc sdd
 391 .Ed
 392 .Pp
 393 Cache devices cannot be mirrored or part of a raidz configuration.
 394 If a read error is encountered on a cache device, that read I/O is reissued to
 395 the original storage pool device, which might be part of a mirrored or raidz
 396 configuration.
 397 .Pp
 398 The content of the cache devices is persistent across reboots and restored
 399 asynchronously when importing the pool in L2ARC (persistent L2ARC).
 400 This can be disabled by setting
 401 .Sy l2arc_rebuild_enabled = 0 .
 402 For cache devices smaller than 1GB we do not write the metadata structures
 403 required for rebuilding the L2ARC in order not to waste space. This can be
 404 changed with
 405 .Sy l2arc_rebuild_blocks_min_l2size .
 406 The cache device header (512 bytes) is updated even if no metadata structures
 407 are written. Setting
 408 .Sy l2arc_headroom = 0
 409 will result in scanning the full-length ARC lists for cacheable content to be
 410 written in L2ARC (persistent ARC). If a cache device is added with
 411 .Nm zpool Cm add
 412 its label and header will be overwritten and its contents are not going to be
 413 restored in L2ARC, even if the device was previously part of the pool. If a
 414 cache device is onlined with
 415 .Nm zpool Cm online
 416 its contents will be restored in L2ARC. This is useful in case of memory pressure
 417 where the contents of the cache device are not fully restored in L2ARC.
 418 The user can off/online the cache device when there is less memory pressure
 419 in order to fully restore its contents to L2ARC.
 420 .Ss Pool checkpoint
 421 Before starting critical procedures that include destructive actions (e.g
 422 .Nm zfs Cm destroy
 423 ), an administrator can checkpoint the pool's state and in the case of a
 424 mistake or failure, rewind the entire pool back to the checkpoint.
 425 Otherwise, the checkpoint can be discarded when the procedure has completed
 426 successfully.
 427 .Pp
 428 A pool checkpoint can be thought of as a pool-wide snapshot and should be used
 429 with care as it contains every part of the pool's state, from properties to vdev
 430 configuration.
 431 Thus, while a pool has a checkpoint certain operations are not allowed.
 432 Specifically, vdev removal/attach/detach, mirror splitting, and
 433 changing the pool's guid.
 434 Adding a new vdev is supported but in the case of a rewind it will have to be
 435 added again.
 436 Finally, users of this feature should keep in mind that scrubs in a pool that
 437 has a checkpoint do not repair checkpointed data.
 438 .Pp
 439 To create a checkpoint for a pool:
 440 .Bd -literal
 441 # zpool checkpoint pool
 442 .Ed
 443 .Pp
 444 To later rewind to its checkpointed state, you need to first export it and
 445 then rewind it during import:
 446 .Bd -literal
 447 # zpool export pool
 448 # zpool import --rewind-to-checkpoint pool
 449 .Ed
 450 .Pp
 451 To discard the checkpoint from a pool:
 452 .Bd -literal
 453 # zpool checkpoint -d pool
 454 .Ed
 455 .Pp
 456 Dataset reservations (controlled by the
 457 .Nm reservation
 458 or
 459 .Nm refreservation
 460 zfs properties) may be unenforceable while a checkpoint exists, because the
 461 checkpoint is allowed to consume the dataset's reservation.
 462 Finally, data that is part of the checkpoint but has been freed in the
 463 current state of the pool won't be scanned during a scrub.
 464 .Ss Special Allocation Class
 465 The allocations in the special class are dedicated to specific block types.
 466 By default this includes all metadata, the indirect blocks of user data, and
 467 any deduplication tables.  The class can also be provisioned to accept
 468 small file blocks.
 469 .Pp
 470 A pool must always have at least one normal (non-dedup/special) vdev before
 471 other devices can be assigned to the special class. If the special class
 472 becomes full, then allocations intended for it will spill back into the
 473 normal class.
 474 .Pp
 475 Deduplication tables can be excluded from the special class by setting the
 476 .Sy zfs_ddt_data_is_special
 477 zfs module parameter to false (0).
 478 .Pp
 479 Inclusion of small file blocks in the special class is opt-in. Each dataset
 480 can control the size of small file blocks allowed in the special class by
 481 setting the
 482 .Sy special_small_blocks
 483 dataset property. It defaults to zero, so you must opt-in by setting it to a
 484 non-zero value. See
 485 .Xr zfs 8
 486 for more info on setting this property.