2018 DIY ZFS NAS (Disaster, mitigated)

My last NAS build was in 2016 and worked out pretty well – everything plugged in together and worked without a problem (Ubuntu LTS + ZFS) and as far as I know it’s still humming along (handed over for Lensley usage).

Since I settled into Seattle, I’ve been limping along with the old Synology DS1812+ hand-me-downs. They’re still chugging along (16.3TB of RAID6-ish (SHR-2) storage), but definitely getting long in the tooth. It’s always performed terribly at rsync due to a very weak CPU, and I’ve been unhappy with the lack of snapshotting and inconvenient shell access, among other things so, at the beginning of the year I decided to finally get moving towards build a new NAS.

I wanted to build a custom ZFS system to get nice things like snapshots and compression, and 8 x 8TB w/ RAID-Z2 seemed like a good perf/security balance (in theory about 43TiB of usable space). When I started at the beginning of the year, the 8TB drives were the best price/perf, but w/ 12TB and 14TB drives out now, the 10TB drives seem to have now taken that slot as of the time of this writeup. I paid a bit extra for HGST drives as they tend to historically have been the most reliable according to Backblaze’s data, although it seems like the newest Seagate drives are performing pretty well.

Because I wanted ECC and decent performance (mainly to run additional server tasks) without paying an arm and a leg, and I had a good experience with first-gen Ryzen on Linux (my main workstation is a 1700 on Arch), I decided that I’d wait for one of the new Ryzen 2400G APUs which would have Ryzen’s under-the-radar ECC support with an IGP (so the mini-ITX PCIe slot could be used for a dedicated storage controller). This would prove to be a big mistake, but more on that in a bit.

2018 NAS

In the end, I did get it all working, but instead of an afternoon of setup, this project dragged on for months and is not quite the traditional NAS I was aiming for. This also just became of of those things where you’d start working on it, run into problems, do some research and maybe order some replacement parts to test out, and then put off because it was such a pain. I figure I’d outline some of the things that went wrong for posterity, as this build was definitely a disaster – the most troublesome build I’ve had in years, if not decades. The problems mostly fell into two categories – SAS, and Ryzen APU (Raven Ridge) issues.

First, the easier part, the SAS issues. This is something that I simply hadn’t encountered, but makes sense looking back. I thought I was pretty familiar w/ how SAS and SATA inter-operated and since I was planning on using a SAS card I had laying around, and using an enclosure w/ a SAS backplane, I figured, what the heck, why not get SAS drives. Well, let me tell you why not – for the HGST Ultrastar He8’s (PDF data sheet) that I bought, the SATA models are SATA III (6Gb/s) models, which will plug into anything and are backwards compatible with just about everything, however the SAS drives are SAS3 (12Gb/s), which it turns out are not backwards compatible at all and require a full SAS3 chain. That means the 6Gb/s (LSI 9207-8i) controller, SFF8087 cables, and the SAS backplane of the otherwise sweet NAS chassis all had to be replaced.

  • I spent time fiddling with applying various firmware updates to the 9207 while I was trying to diagnose my drive errors. This might be relevant for those that decide to go with the 9207 as a SAS controller for regular SATA drives as P20 sometimes has problems vs P19 and you may need to downgrade. You’ll also want the IT FW (these are terrible RAID cards and you want to just pass through the devices). Note, despite the numbering, the 9207 is newer than the 9211 (PCIe 3.0 vs 2.0)
  • Some people have issues with forward vs reverse breakout cables but I didn’t and the spreadsheet at the end of this writeup links to the right cables to buy
  • SAS3 is more expensive across the board – you pay a slight premium (not much) for the drives, and then about double for the controller (I paid $275 for the Microsemi Adaptec 8805E 12Gb/s controller vs $120 for my LSI 9207-8i 6Gb/s controller) and cables (if you’ve made it this far though, $25 is unlikely to break the bank). The biggest pain was finding a 12Gb/s backplane – they didn’t have those for my NAS case, and other cases available were pretty much all ginormous rackmounts. The cheapest option for me ended up being simply buying 2 hot-swap 12Gb/s enclosures (you must get the T3 model w/ the right backplane) and just letting them live free-range
  • BTW, just as a note: if you have a choice between 512b and 4K (AF) sector drives, choose the latter, performance will be much better. If you are using ZFS, be sure to create your pool with ashift=12 to match the sectors

All this work is for bandwidth that honestly, 8 spinning-rust disks are unlikely to use, so if I were doing it over again, I’d probably go with SATA and save myself a lot of time and money.

Speaking of wastes of time and money, the biggest cause of my NAS building woes by far was the Ryzen 2400G APU (Raven Ridge). Quite simply, even as of July 2018, I simply can’t recommend the Ryzen APUs if you’re running Linux. You’ll have to pay more and have slim pickings on the motherboard front if you want mini-ITX and ECC, but you’ll probably spend a lot less time pulling your hair out.

  • I bought the ASRock Mini-ITX board as their reps had confirmed ECC and Raven Ridge support. Of course, the boards in channel didn’t (still don’t depending on manufacture) support the Ryzen APUs out of the box and you can’t boot to update the BIOS without a compatible CPU. AMD has a “boot CPU” program but it was an involved process and after a few emails I just ordered a CPU from Amazon to use and sent it back when I finished (I’ve made 133 Amazon orders in the past 6 months so I don’t feel too bad about that). I had intermittent booting issues (it’d boot a blank screen about half the time) w/ Ubuntu 18.04 until I updated to the latest 4.60 BIOS).
  • With my LSI 9207 card plugged in, 18.04 LTS (4.15 kernel) seemed happy enough (purely with TTYs, I haven’t run any Xorg on this, which has its own even worse set of issues), however with the Adaptec 8805E, it wouldn’t boot at all. Not even the install media would boot on Ubuntu, however, the latest Arch installer would (I’d credit the 4.17 kernel). There’s probably some way to slipstream an updated kernel into LTS installer (my preference generally is to run LTS on servers), but in the end, I couldn’t be that bothered and just went with Arch (and archzfs) on this machine. YOLO!
  • After I got everything seemingly installed and working, I was getting some lockups overnight. These hangs left no messages in dmesg or journalctl logs. Doing a search on Ryzen 2400G, Raven Ridge, and Ryzen motherboard lockups/hangs/crashes will probably quickly make you realize why I won’t recommend Ryzen APUs to anyone. In the end I went into the BIOS and basically disabled anything that might be causing a problem and it seems to be pretty stable (at the cost of constantly high power usage:
    • Disable Cool’n’Quiet
    • Disable Global C-States
    • Disable Sound
    • Disable WAN Radio
    • Disable SATA (my boot drive is NVMe)
    • Disable Suspend to RAM, other ACPI options
  • It’s also worth noting that while most Ryzen motherboards will support ECC for Summit Ridge (Ryzen 1X00) and Pinnacle Ridge (non-APU Ryzen 2X00), they don’t support ECC on Raven Ridge (unbuffered ECC memory will run, but in non-ECC mode), despite Raven Ridge having ECC support in their memory controllers. There’s a lot of confusion on this topic if you do Google searches so it was hard to suss out, but from what I’ve seen, there have been no confirmed reports of ECC on Raven Ridge working on any motherboard. Here’s the way I checked to see if ECC was actually enabled or not:
    # journalctl -b | grep EDAC
    Jul 28 20:11:31 z kernel: EDAC MC: Ver: 3.0.0
    Jul 28 20:11:32 z kernel: EDAC amd64: Node 0: DRAM ECC disabled.
    Jul 28 20:11:32 z kernel: EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
    
    # modprobe -v amd64_edac_mod ecc_enable_override=1
    insmod /lib/modules/4.17.10-1-ARCH/kernel/drivers/edac/amd64_edac_mod.ko.xz ecc_enable_override=1
    modprobe: ERROR: could not insert 'amd64_edac_mod': No such device
    # edac-ctl --status
    edac-ctl: drivers not loaded.
    # edac-ctl --mainboard
    edac-ctl: mainboard: ASRock AB350 Gaming-ITX/ac
    

OK, so what’s the end result look like? The raw zpool (64TB == 58.2TiB):

# zpool list                                                                                                          
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
z       58T  5.39T  52.6T         -     0%     9%  1.00x  ONLINE  -    

And here’s what the actual storage looks like (LZ4 compression running):

# df -h | grep z                                                                                                      
Filesystem      Size  Used Avail Use% Mounted on                                                                              
z                40T  4.0T   37T  10% /z

# zfs get compressratio z                                                                                             
NAME  PROPERTY       VALUE  SOURCE                                                                                            
z     compressratio  1.01x  -

Temperatures for the card (I have a 120mm fan pointed at it) and the drives seem pretty stable (+/- 1C or so):

# arcconf GETCONFIG 1 | grep Temperature                                                                              
   Temperature                              : 42 C/ 107 F (Normal)
         Temperature                        : 41 C/ 105 F
         Temperature                        : 44 C/ 111 F
         Temperature                        : 44 C/ 111 F
         Temperature                        : 40 C/ 104 F
         Temperature                        : 40 C/ 104 F
         Temperature                        : 43 C/ 109 F
         Temperature                        : 43 C/ 109 F
         Temperature                        : 40 C/ 104 F

Performance seems decent, about 300MB/s copying from a USB-C/SATA SSD. Here is are the results of an iozone (3.482) run (settings taken from this benchmark):

# iozone -i 0 -i 1 -t 1 -s16g -r16k  -t 20 /z                                   
        File size set to 16777216 kB
        Record Size 16 kB
        Command line used: iozone -i 0 -i 1 -t 1 -s16g -r16k -t 20 /z                  
        Output is in kBytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 kBytes.                                       
        Processor cache line size set to 32 bytes.                                     
        File stride size set to 17 * record size.                                      
        Throughput test with 20 processes
        Each process writes a 16777216 kByte file in 16 kByte records                  

        Children see throughput for  1 initial writers  = 1468291.75 kB/sec            
        Parent sees throughput for  1 initial writers   = 1428999.28 kB/sec            
        Min throughput per process                      = 1468291.75 kB/sec            
        Max throughput per process                      = 1468291.75 kB/sec            
        Avg throughput per process                      = 1468291.75 kB/sec            
        Min xfer                                        = 16777216.00 kB               

        Children see throughput for  1 rewriters        = 1571411.62 kB/sec            
        Parent sees throughput for  1 rewriters         = 1426592.78 kB/sec            
        Min throughput per process                      = 1571411.62 kB/sec            
        Max throughput per process                      = 1571411.62 kB/sec            
        Avg throughput per process                      = 1571411.62 kB/sec            
        Min xfer                                        = 16777216.00 kB               

        Children see throughput for  1 readers          = 3732752.00 kB/sec            
        Parent sees throughput for  1 readers           = 3732368.39 kB/sec            
        Min throughput per process                      = 3732752.00 kB/sec            
        Max throughput per process                      = 3732752.00 kB/sec            
        Avg throughput per process                      = 3732752.00 kB/sec            
        Min xfer                                        = 16777216.00 kB               

        Children see throughput for 1 re-readers        = 3738624.75 kB/sec            
        Parent sees throughput for 1 re-readers         = 3738249.69 kB/sec            
        Min throughput per process                      = 3738624.75 kB/sec            
        Max throughput per process                      = 3738624.75 kB/sec            
        Avg throughput per process                      = 3738624.75 kB/sec            
        Min xfer                                        = 16777216.00 kB               


        Each process writes a 16777216 kByte file in 16 kByte records                  

        Children see throughput for 20 initial writers  = 1402434.54 kB/sec            
        Parent sees throughput for 20 initial writers   = 1269383.28 kB/sec            
        Min throughput per process                      =   66824.69 kB/sec            
        Max throughput per process                      =   73967.23 kB/sec            
        Avg throughput per process                      =   70121.73 kB/sec            
        Min xfer                                        = 15157264.00 kB               

        Children see throughput for 20 rewriters        =  337542.41 kB/sec            
        Parent sees throughput for 20 rewriters         =  336665.90 kB/sec            
        Min throughput per process                      =   16713.62 kB/sec            
        Max throughput per process                      =   17004.56 kB/sec            
        Avg throughput per process                      =   16877.12 kB/sec            
        Min xfer                                        = 16490176.00 kB     

        Children see throughput for 20 readers          = 3451576.27 kB/sec            
        Parent sees throughput for 20 readers           = 3451388.13 kB/sec            
        Min throughput per process                      =  171099.14 kB/sec            
        Max throughput per process                      =  173923.14 kB/sec            
        Avg throughput per process                      =  172578.81 kB/sec            
        Min xfer                                        = 16505216.00 kB               

        Children see throughput for 20 re-readers       = 3494448.80 kB/sec            
        Parent sees throughput for 20 re-readers        = 3494333.50 kB/sec            
        Min throughput per process                      =  173403.55 kB/sec            
        Max throughput per process                      =  176221.58 kB/sec            
        Avg throughput per process                      =  174722.44 kB/sec            
        Min xfer                                        = 16508928.00 kB               

While running this it looked like the each of the 4 cores hit about 10-15% on htop and load about 21-22 (waiting on iozone of course). Here are the arcstats during the top of the run:

# arcstat.py 1
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c              
18:15:40  266K   24K      9     0    0   24K  100    24    0   7.2G  7.2G              
18:15:41  293K   26K      9     2    0   26K  100    28    0   7.1G  7.1G              
18:15:42  305K   27K      9     0    0   27K  100    26    0   7.1G  7.1G              

Anyway, those were some surprising big (totally synthetic numbers), but I don’t have much of a reference, so a comparison, I ran the same test on the cheap ADATA M.2 NVMe SSD that I use for my boot drive:

# iozone -i 0 -i 1 -t 1 -s16g -r16k -t 20 ./
	File size set to 16777216 kB
	Record Size 16 kB
	Command line used: iozone -i 0 -i 1 -t 1 -s16g -r16k -t 20 ./
	Output is in kBytes/sec
	Time Resolution = 0.000001 seconds.
	Processor cache size set to 1024 kBytes.
	Processor cache line size set to 32 bytes.
	File stride size set to 17 * record size.
	Throughput test with 20 processes
	Each process writes a 16777216 kByte file in 16 kByte records

	Children see throughput for  1 initial writers 	=  737763.19 kB/sec
	Parent sees throughput for  1 initial writers 	=  628783.43 kB/sec
	Min throughput per process 			=  737763.19 kB/sec 
	Max throughput per process 			=  737763.19 kB/sec
	Avg throughput per process 			=  737763.19 kB/sec
	Min xfer 					= 16777216.00 kB

	Children see throughput for  1 rewriters 	=  537308.31 kB/sec
	Parent sees throughput for  1 rewriters 	=  453965.01 kB/sec
	Min throughput per process 			=  537308.31 kB/sec 
	Max throughput per process 			=  537308.31 kB/sec
	Avg throughput per process 			=  537308.31 kB/sec
	Min xfer 					= 16777216.00 kB

	Children see throughput for  1 readers 		=  710123.75 kB/sec
	Parent sees throughput for  1 readers 		=  710108.56 kB/sec
	Min throughput per process 			=  710123.75 kB/sec 
	Max throughput per process 			=  710123.75 kB/sec
	Avg throughput per process 			=  710123.75 kB/sec
	Min xfer 					= 16777216.00 kB

	Children see throughput for 1 re-readers 	=  709986.50 kB/sec
	Parent sees throughput for 1 re-readers 	=  709970.87 kB/sec
	Min throughput per process 			=  709986.50 kB/sec 
	Max throughput per process 			=  709986.50 kB/sec
	Avg throughput per process 			=  709986.50 kB/sec
	Min xfer 					= 16777216.00 kB

# oops, runs out of space trying to run the 20 thread test
# only 90GB free on the boot drive...

One more iozone test, pulled from this benchmark. The ZFS volume:


	Using minimum file size of 131072 kilobytes.
	Using maximum file size of 1048576 kilobytes.
	Record Size 16 kB
	OPS Mode. Output is in operations per second.
	Auto Mode
	Command line used: iozone -n 128M -g 1G -r 16 -O -a C 1
	Time Resolution = 0.000001 seconds.
	Processor cache size set to 1024 kBytes.
	Processor cache line size set to 32 bytes.
	File stride size set to 17 * record size.
                                                              random    random     bkwd    record    stride                                    
              kB  reclen    write  rewrite    read    reread    read     write     read   rewrite      read   fwrite frewrite    fread  freread
          131072      16   107383   153753   454176   497752   402551   100747   337201    228240    143842   149257   107891   243222   227732
          262144      16   114165   134997   194843   209168    62298    78727   166649    227852     47554   120907   121755   206258   208830
          524288      16   108586   130493   228032   235020    53501    48555   190495    224892     45273   110338   113536   229965   205326
         1048576      16    83337    94119   203190   231459    46765    34392   180697    230120     44962    92476   112578   198107   230100

And the boot NVMe SSD:


	Using minimum file size of 131072 kilobytes.
	Using maximum file size of 1048576 kilobytes.
	Record Size 16 kB
	OPS Mode. Output is in operations per second.
	Auto Mode
	Command line used: iozone -n 128M -g 1G -r 16 -O -a C 1
	Time Resolution = 0.000001 seconds.
	Processor cache size set to 1024 kBytes.
	Processor cache line size set to 32 bytes.
	File stride size set to 17 * record size.
                                                              random    random     bkwd    record    stride                                    
              kB  reclen    write  rewrite    read    reread    read     write     read   rewrite      read   fwrite frewrite    fread  freread
          131072      16   124072   218418   407623   392939   462018   211145   392298    520162    435927   218552   206769   411371   453720
          262144      16   125471   236933   454936   427993   449000   212884   423337    525110    452045   229310   221575   451959   494413
          524288      16   123998   252096   520458   459482   511823   229332   496952    526485    509769   243921   239714   519689   547162
         1048576      16   125236   266330   562313   480948   476196   220034   498221    529250    471102   249651   247500   560203   571394

And then one more quick comparison testing using bonnie++. The ZFS volume:

# bonnie++ -u root -r 1024 -s 16384 -d /z -f -b -n 1 -c 4
Version  1.97       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   4     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
z               16G           1084521  82 933471  88           2601777  99  1757  52
Latency                         103ms     103ms              3469us   92404us
Version  1.97       ------Sequential Create------ --------Random Create--------
z                   -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                  1   125   1 +++++ +++   117   1   124   1 +++++ +++   120   1
Latency               111ms       4us   70117us   49036us       5us   43939us

And the boot NVMe SSD:


Version  1.97       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   4     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
z               16G           118358   7 49294   3           392088  10  5793  63
Latency                         178ms    1121ms              1057ms   32875us
Version  1.97       ------Sequential Create------ --------Random Create--------
z                   -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                  1  2042   3 +++++ +++   857   1   767   1 +++++ +++   839   1
Latency              2135us     106us   39244us   10252us       5us   35507us

Now granted, this is a cheap/slow NVMe SSD (I have a 512GB 970 Pro in a box here, but I’m too lazy/don’t care enough to reinstall on that to test), but the ZFS results surprised me. Makes you wonder whether an array of enterprise SAS SSDs would beat out say those PCIe SSD cards, but I don’t get revved enough about storage speeds to really do more than pose the question. I may do a bit more reading on tuning, but I’m basically limited by my USB and network connection (a single Intel I211A 1 Gbps) anyways. Next steps will be centralizing all my data, indexing, deduping, and making sure that I have all my backups sorted. (I may have some files that aren’t’ backed up, but that’s outweighed by many, many files that I probably have 3-4 copies of…)

Oh, and for those looking to build something like this (again, I’d reiterate: don’t buy a Ryzen APU if you plan on running Linux and value your sanity), here’s the final worksheet that includes the replaced parts that I bought putting this little monster together (interesting note: $/GB did not go down for my storage builds for the past few years):

Misc notes:

  • If you boot with a USB attached it’ll boot up mapped as /dev/sda, nbd if you’re mapping your ZFS properly
  • Bootup takes about 1 minute – about 45s of that is with the controller card BIOS
  • I replaced the AMD stock cooler w/ an NH-L9a so it could fit into the NSC-810 NAS enclosure, but obviously, that isn’t needed if you’re just going to leave your parts out in the open (I use nylon M3 spacers and shelf liners to keep from shorting anything out since I deal with a lot of bare hardware these days)

2018-08-30 UPDATE: I’ve been running the NAS for a month and while it was more of an adventure than I would have liked to setup, it’s been performing well. Since there’s no ECC support for Raven Ridge on any motherboards at the moment, I RMA’d my release Ryzen 7 1700 (consistent segfaults when running ryzen-test but I never bothered to swap it since I didn’t want the downtime and I wasn’t running anything mission critical) so I could swap that into the NAS to get ECC support. This took a few emails (AMD is well aware of the issue) and about two weeks to do the swap. Once I got the CPU back, setup was pretty straightforward – the only issue was I was expecting the wifi card to use a mini-PCIe slot, but it uses a B-keyed M.2 instead, so I’m running my server completely headless ATM until I bother to find an adapter. (I know I have one somewhere…)

# journalctl -b | grep EDAC
Aug 30 08:00:28 z kernel: EDAC MC: Ver: 3.0.0
Aug 30 08:00:28 z kernel: EDAC amd64: Node 0: DRAM ECC enabled.
Aug 30 08:00:28 z kernel: EDAC amd64: F17h detected (node 0).
Aug 30 08:00:28 z kernel: EDAC MC: UMC0 chip selects:
Aug 30 08:00:28 z kernel: EDAC amd64: MC: 0:  8192MB 1:     0MB
Aug 30 08:00:28 z kernel: EDAC amd64: MC: 2:     0MB 3:     0MB
Aug 30 08:00:28 z kernel: EDAC amd64: MC: 4:     0MB 5:     0MB
Aug 30 08:00:28 z kernel: EDAC amd64: MC: 6:     0MB 7:     0MB
Aug 30 08:00:28 z kernel: EDAC MC: UMC1 chip selects:
Aug 30 08:00:28 z kernel: EDAC amd64: MC: 0:  8192MB 1:     0MB
Aug 30 08:00:28 z kernel: EDAC amd64: MC: 2:     0MB 3:     0MB
Aug 30 08:00:28 z kernel: EDAC amd64: MC: 4:     0MB 5:     0MB
Aug 30 08:00:28 z kernel: EDAC amd64: MC: 6:     0MB 7:     0MB
Aug 30 08:00:28 z kernel: EDAC amd64: using x8 syndromes.
Aug 30 08:00:28 z kernel: EDAC amd64: MCT channel count: 2
Aug 30 08:00:28 z kernel: EDAC MC0: Giving out device to module amd64_edac controller F17h: DEV 0000:00:18.3 (INTERRUPT)
Aug 30 08:00:28 z kernel: EDAC PCI0: Giving out device to module amd64_edac controller EDAC PCI controller: DEV 0000:00:18.0 (POLLED)
Aug 30 08:00:28 z kernel: AMD64 EDAC driver v3.5.0
Aug 30 08:00:32 z systemd[1]: Starting Initialize EDAC v3.0.0 Drivers For Machine Hardware...
Aug 30 08:00:32 z systemd[1]: Started Initialize EDAC v3.0.0 Drivers For Machine Hardware.

# edac-ctl --status
edac-ctl: drivers are loaded.
# edac-ctl --mainboard
edac-ctl: mainboard: ASRock AB350 Gaming-ITX/ac

# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: mc#0csrow#0channel#0: 0 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#1: 0 Corrected Errors
edac-util: No errors to report.