Documentation/md-cluster.txt

   1 The cluster MD is a shared-device RAID for a cluster.
   2
   3
   4 1. On-disk format
   5
   6 Separate write-intent-bitmap are used for each cluster node.
   7 The bitmaps record all writes that may have been started on that node,
   8 and may not yet have finished. The on-disk layout is:
   9
  10 0                    4k                     8k                    12k
  11 -------------------------------------------------------------------
  12 | idle                | md super            | bm super [0] + bits |
  13 | bm bits[0, contd]   | bm super[1] + bits  | bm bits[1, contd]   |
  14 | bm super[2] + bits  | bm bits [2, contd]  | bm super[3] + bits  |
  15 | bm bits [3, contd]  |                     |                     |
  16
  17 During "normal" functioning we assume the filesystem ensures that only one
  18 node writes to any given block at a time, so a write
  19 request will
  20  - set the appropriate bit (if not already set)
  21  - commit the write to all mirrors
  22  - schedule the bit to be cleared after a timeout.
  23
  24 Reads are just handled normally.  It is up to the filesystem to
  25 ensure one node doesn't read from a location where another node (or the same
  26 node) is writing.
  27
  28
  29 2. DLM Locks for management
  30
  31 There are two locks for managing the device:
  32
  33 2.1 Bitmap lock resource (bm_lockres)
  34
  35  The bm_lockres protects individual node bitmaps. They are named in the
  36  form bitmap001 for node 1, bitmap002 for node and so on. When a node
  37  joins the cluster, it acquires the lock in PW mode and it stays so
  38  during the lifetime the node is part of the cluster. The lock resource
  39  number is based on the slot number returned by the DLM subsystem. Since
  40  DLM starts node count from one and bitmap slots start from zero, one is
  41  subtracted from the DLM slot number to arrive at the bitmap slot number.
  42
  43 3. Communication
  44
  45 Each node has to communicate with other nodes when starting or ending
  46 resync, and metadata superblock updates.
  47
  48 3.1 Message Types
  49
  50  There are 3 types, of messages which are passed
  51
  52  3.1.1 METADATA_UPDATED: informs other nodes that the metadata has been
  53    updated, and the node must re-read the md superblock. This is performed
  54    synchronously.
  55
  56  3.1.2 RESYNC: informs other nodes that a resync is initiated or ended
  57    so that each node may suspend or resume the region.
  58
  59 3.2 Communication mechanism
  60
  61  The DLM LVB is used to communicate within nodes of the cluster. There
  62  are three resources used for the purpose:
  63
  64   3.2.1 Token: The resource which protects the entire communication
  65    system. The node having the token resource is allowed to
  66    communicate.
  67
  68   3.2.2 Message: The lock resource which carries the data to
  69    communicate.
  70
  71   3.2.3 Ack: The resource, acquiring which means the message has been
  72    acknowledged by all nodes in the cluster. The BAST of the resource
  73    is used to inform the receive node that a node wants to communicate.
  74
  75 The algorithm is:
  76
  77  1. receive status
  78
  79    sender                         receiver                   receiver
  80    ACK:CR                          ACK:CR                     ACK:CR
  81
  82  2. sender get EX of TOKEN
  83     sender get EX of MESSAGE
  84     sender                        receiver                 receiver
  85     TOKEN:EX                       ACK:CR                   ACK:CR
  86     MESSAGE:EX
  87     ACK:CR
  88
  89     Sender checks that it still needs to send a message. Messages received
  90     or other events that happened while waiting for the TOKEN may have made
  91     this message inappropriate or redundant.
  92
  93  3. sender write LVB.
  94     sender down-convert MESSAGE from EX to CR
  95     sender try to get EX of ACK
  96     [ wait until all receiver has *processed* the MESSAGE ]
  97
  98                                      [ triggered by bast of ACK ]
  99                                      receiver get CR of MESSAGE
 100                                      receiver read LVB
 101                                      receiver processes the message
 102                                      [ wait finish ]
 103                                      receiver release ACK
 104
 105    sender                         receiver                   receiver
 106    TOKEN:EX                       MESSAGE:CR                 MESSAGE:CR
 107    MESSAGE:CR
 108    ACK:EX
 109
 110  4. triggered by grant of EX on ACK (indicating all receivers have processed
 111     message)
 112     sender down-convert ACK from EX to CR
 113     sender release MESSAGE
 114     sender release TOKEN
 115                                receiver upconvert to EX of MESSAGE
 116                                receiver get CR of ACK
 117                                receiver release MESSAGE
 118
 119    sender                      receiver                   receiver
 120    ACK:CR                       ACK:CR                     ACK:CR
 121
 122
 123 4. Handling Failures
 124
 125 4.1 Node Failure
 126  When a node fails, the DLM informs the cluster with the slot. The node
 127  starts a cluster recovery thread. The cluster recovery thread:
 128         - acquires the bitmap<number> lock of the failed node
 129         - opens the bitmap
 130         - reads the bitmap of the failed node
 131         - copies the set bitmap to local node
 132         - cleans the bitmap of the failed node
 133         - releases bitmap<number> lock of the failed node
 134         - initiates resync of the bitmap on the current node
 135
 136  The resync process, is the regular md resync. However, in a clustered
 137  environment when a resync is performed, it needs to tell other nodes
 138  of the areas which are suspended. Before a resync starts, the node
 139  send out RESYNC_START with the (lo,hi) range of the area which needs
 140  to be suspended. Each node maintains a suspend_list, which contains
 141  the list  of ranges which are currently suspended. On receiving
 142  RESYNC_START, the node adds the range to the suspend_list. Similarly,
 143  when the node performing resync finishes, it send RESYNC_FINISHED
 144  to other nodes and other nodes remove the corresponding entry from
 145  the suspend_list.
 146
 147  A helper function, should_suspend() can be used to check if a particular
 148  I/O range should be suspended or not.
 149
 150 4.2 Device Failure
 151  Device failures are handled and communicated with the metadata update
 152  routine.
 153
 154 5. Adding a new Device
 155 For adding a new device, it is necessary that all nodes "see" the new device
 156 to be added. For this, the following algorithm is used:
 157
 158     1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
 159        ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD)
 160     2. Node 1 sends NEWDISK with uuid and slot number
 161     3. Other nodes issue kobject_uevent_env with uuid and slot number
 162        (Steps 4,5 could be a udev rule)
 163     4. In userspace, the node searches for the disk, perhaps
 164        using blkid -t SUB_UUID=""
 165     5. Other nodes issue either of the following depending on whether the disk
 166        was found:
 167        ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
 168                 disc.number set to slot number)
 169        ioctl(CLUSTERED_DISK_NACK)
 170     6. Other nodes drop lock on no-new-devs (CR) if device is found
 171     7. Node 1 attempts EX lock on no-new-devs
 172     8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk
 173        as SpareLocal
 174     9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED
 175     10. Other nodes get the information whether a disk is added or not
 176         by the following METADATA_UPDATED.