Thursday, February 4, 2010

Grow linux md raid5 with mdadm --grow

Growing an mdadm RAID array is fairly straight forward these days. There's a few limitations, depending on your setup, and I strongly recommend you read the mdadm man page in addition to the notes here.

A couple of the limitations include:
  • raid arrays in a container can not be grown, so this excludes DDF arrays
  • arrays with 0.9x metadata are limited to 2Gb components - the total size of the array is not affected though


Before you start it's a good idea to run a consistency check on the array. Depending on the size of the array this can take a looong time. On my 3 x 1Tb RAID5 array this usually takes around 10 hours with the default settings. You can explore tweaking the settings, though I haven't done this for checks yet. We will see how we can tweak the settings for the reshape later on.

Running a consistency check is done as follows. I don't have the sample mdstat output at this time but have included the command for consistency.
# echo check >> /sys/block/md4/md/sync_action
# cat /proc/mdstat


You'll see if any errors were corrected in the array parity in the dmesg output and/or kernel logs.

Once the check is complete you should be safe to grow the array. First you have to add a new device to it so there is a spare drive in the set.
mdadm --add /dev/md3 /dev/sdc1


The event will appear in the dmesg output and the spare will show up in mdstat:
# dmesg
md: bind

# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md3 : active raid5 sdc1[s] sdb1[0] sda1[2] hdd1[1]
1953519872 blocks super 0.91 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]


Now the spare is there - you can give the command to grow the array:
# mdadm --grow /dev/md3 --backup-file=~/mdadm-grow-backup-file.mdadm --raid-devices=4


The array now starts reshaping. You can monitor progress:
# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md3 : active raid5 sdc1[3] sdb1[0] sda1[2] hdd1[1]
1953519872 blocks super 0.91 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
[>....................] reshape = 3.3% (33021416/976759936) finish=1421.0min speed=10341K/sec


In dmesg you should see something like this:
# dmesg

RAID5 conf printout:
--- rd:4 wd:4
disk 0, o:1, dev:sdb1
disk 1, o:1, dev:hdd1
disk 2, o:1, dev:sda1
disk 3, o:1, dev:sdc1
md: reshape of RAID array md3
md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
md: using 128k window, over a total of 976759936 blocks.


Before tweaking the speed settings, now is a good time to edit your /etc/mdadm.conf file with the new ARRAY changes so it's recognized and started on your next reboot.

Now we can tweak the speed settings to speed up the reshape time. I played around with a few settings and found the following to be good for my own system.

# echo 8192 >> /sys/block/md3/md/stripe_cache_size
# echo 15000 >> /sys/block/md3/md/sync_speed_min
# echo 200000 >> /sys/block/md3/md/sync_speed_max


On my system this cut about a third off the predicted finish time:
# cat /proc/mdstat

Personalities : [raid1] [raid6] [raid5] [raid4]
md3 : active raid5 sdc1[3] sdb1[0] sda1[2] hdd1[1]
1953519872 blocks super 0.91 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
[>....................] reshape = 4.0% (39983488/976759936) finish=957.6min speed=16303K/sec


It seems values greater than 8192 for stripe_cache_size were more harmful than beneficial on my system. It's not clear to me if this is CPU bound or bandwidth to the drives, though looking at older posts I suspect both can play a roll.

Also note that reducing the stripe_cache_size may not occur immediately when you echo a smaller value to the file. I had to echo smaller values several times before the value was adopted. This was on kernel 2.6.32.7.

You can monitor the stripe_cache_active file to see how filled the cache is:
# cat /sys/block/md3/md/stripe_cache_active
7136


When the reshape is complete you will still need to grow the file system (or volume group if you use LVM) contained in there. I'll document that tomorrow when my reshape is complete ;)