Posted by Scott Laird 395 days ago
I now have Solaris up and running and reasonably stable-looking, after only 12 hours of work. A number of things turned out to be bigger issues than I’d anticipated, largely because it’s been years since I last used Solaris and, frankly, Solaris’s disk partitioning and formatting tools suck.
- My first problem is still unresolved: my BIOS refuses to boot from the IDE DVD drive that I installed. Once the system boots, it works just fine, so I’m not sure what’s up. Maybe a BIOS bug. Fortunately, the system’s perfectly happy booting off a USB DVD drive, and (amazingly) Solaris is happy installing from it.
- The GC-RAMDISK card that I was looking forward to testing is a complete failure so far. I don’t know if I have a bad card or if it’s simply incompatible with both SATA chips in the system, but the BIOS completely ignores it if its plugged into the motherboard, and Solaris fails to talk to it on either bus. If it’s plugged into the MB, then I get a
device failed initialization error; if it’s plugged into the PCI-X SATA card then I get device on port 5 still busy after reset. I’ve swapped cables and RAM. I’d really like to get it to work, so I’m going to try it with an older system before RMAing it.
- Actually getting a working Solaris install took me 3 tries. The first time I installed it to the wrong drive (the first disk on the PCI-X card, not the first disk on the motherboard), and it was unable to mount the root partition after rebooting. Next, I managed to install it onto an EFI partition, and that wouldn’t boot either. Finally, I installed it onto the second drive on the right bus, and that worked.
- Since Solaris’s installer doesn’t support ZFS yet, I had to manually copy the root filesystem onto a newly created ZFS filesystem mirrored across a pair of drives. The directions are helpful, but I kept screwing things up. First, I accidentally created the ZFS pool using the entire disk, which made ZFS re-label the drives with EFI, which makes them unbootable. Then, I missed the line in the directions that says to run
format -e instead of using format; that left me with a pair of nicely partitioned drives that still used EFI. The third try worked, and the system is now booting off of ZFS via GRUB without problems. Er…
- Well, one problem–I can’t change the GRUB
menu.lst file for some reason. I don’t know where GRUB is looking for it, but it’s not in /boot/grub/menu.lst on my boot array. My changes are being completely ignored. I can live with this for the weekend.
- OpenSolaris doesn’t ship with drivers for the ASUS P5K WS’s onboard Ethernet chips. I had to grab them from Marvell, but it was easy enough to install.
- Creating an 8-drive ZFS filesystem is trivial. One command takes care of RAID, logical volume management, creates the filesystem, and mounts it:
zpool create -f space raidz2 c0t1d0 c0t2d0 c0t3d0 c0t4d0 c3t2d0 c3t3d0 c3t4d0 c3t5d0.
- ZFS performance is decent. Here’s a bonnie with and without ZFS compression, using 10 GB of data on a box with 2 GB of RAM:
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine GB M/sec %CPU M/sec %CPU M/sec %CPU M/sec %CPU M/sec %CPU /sec %CPU
zfs 10 105.0 55.3 163.3 27.3 121.0 30.4 119.2 88.4 287.1 36.2 169 1.8
zfs+c 10 112.9 59.7 181.5 30.3 127.8 29.1 118.1 86.0 424.9 52.2 198 2.1
- 163 MB/sec writing and 287 MB/sec reading is good enough for me. I was expecting slightly higher numbers, but there’s nothing here to complain about. Adding compression improves writing a bit and makes a big difference reading. It’s quite a bit faster then GigE, which was my goal.
Posted by Scott Laird 396 days ago
A few days ago, I mentioned that my home NAS box had failed, and that I was considering replacing it with a PC server running OpenSolaris and ZFS. I’ve read a pile of ZFS docs, and it looks like the best option available to me today, so I decided to order some suitable hardware.
At that point, pretty much everything broke down. I have a hard enough time keeping track of which hardware works with Linux this week, and OpenSolaris is completely new to me. Sun’s list of officially-supported hardware is pretty sparse, and digging through their mailing list archives gets frustrating quickly. From what I can tell, it boils down to:
- Current Intel and AMD CPUs are all fine.
- Most of Intel’s chipsets are fine.
- Most of nVidia’s AMD chipsets are fine.
- nVidia and Intel video chips are good.
- Most common Ethernet chipsets are either supported natively or have drivers available.
- The only SATA controllers that work are Intel’s ICH southbridges, Silicon Image’s PCI and PCI-E chips, Marvell’s PCI chips, and nVidia’s southbridges. It’s not clear that Marvell’s PCI-E chips work. Most motherboards with additional, non-southbridge SATA ports probably won’t work.
- Venturing too far outside of this list will probably result in problems.
I was looking for a motherboard with 8 SATA ports, and was hoping that the Intel D975XBX2 (“Bad Axe 2”) would work, but 4 of its 8 SATA ports belong to a Marvell PCI-E SATA chip that doesn’t appear to be supported. I went through every single 8-port motherboard in Newegg’s (the ‘WS’ is important–the P5K is a different board). It only has 6 on-board SATA ports, but it includes a PCI-X slot. That’ll let me use the Supermicro AOC-SAT2-MV8, which is far and away the cheapest 8-port SATA card on the market. That’ll give me a total of 14 SATA ports, which should be enough for a whatever I want to throw at it. The Marvell PCI-X chip at the heart of the Supermicro card is the same one used in Sun’s Sun Fire x4500 48-drive server, so it’s safe to assume that Sun has put a lot of effort into the driver.
Most of the test of the system is fairly generic–a cheap nVidia 7200GS video card (the cheapest PCI-E card that NewEgg carries), a nice case and power supply, RAM, and a boatload of drives.
The one odd component that I’ve added is a Gigabyte GC-RAMDISK with 1 GB of RAM. The GC-RAMDISK is a battery-backed SATA ramdisk; it looks like a hard drive to the system and can survive up to 18 hours without power. I’ve had my eye on this thing for years, and it looks like it’ll be a perfect external log device for GFS. I had to ask to see how ZFS will behave if the device fails, and it looks like manual intervention may be required after an 18+ hour power outage, but it should be pretty minimal. I’m planning on posting some benchmarks here once I’ve had a chance to try it out.
Assuming that I’m able to get this whole mess to work at all, I should have lots to write about here over the next week or so. I’m going to start by explaining why I want to use Solaris instead of Linux or *BSD, and why I’m building something instead of buying a pre-build NAS box.
Posted by Scott Laird 397 days ago
So, as part of my new home server series, I want to explain why I’m using OpenSolaris instead of Linux.
I’ve used Linux since 0.97.1, in August of 1992. I’ve had at least one Linux box at home continuously since 1993 or so. I’ve had a few small chunks of my code added to the kernel over the years. I’ve built several install disks and one embedded appliance distro from scratch, starting with a kernel and busybox and going on up from there. I’ve written X drivers, camera drivers, and drivers for embedded devices on the motherboard. I’ve managed Great Heaping Big Gobs of Hardware at various jobs. Basically, I know Linux well, and I’ve used it for almost half of my life.
That in itself might mean that it’s time for a change–professionally, I’ve been very tightly focused on Linux, and diversity is a good thing. But that’s not why I’m using Solaris this week. I’m using it because I’m fed up with losing data to weird RAID issues with Linux, and I believe that OpenSolaris with ZFS will be substantially more reliable long-term. Things I’m specifically fed up with:
- md (the Linux RAID driver)’s response to any sort of drive error, even a transient timeout, is to kick the drive from the array, no matter what. Most of the IDE drives that I’ve had over the years have been prone to random timeouts every few months, at least once you bundle more then 2 or 3 of them in a single box and then try snaking massive ribbon cable through the case. My SATA experiences haven’t been substantially better. Linux will happily bump an otherwise working 4-drive RAID 5 array to a 3-drive degraded RAID 5 array on the first failure, and then on to a 2-drive failed array on the second failure. Even when a simple retry would have cleared both errors. This has cost me data repeatedly, because I’ve been forced to manually intervene and re-add “failed” disks to RAID arrays. If I was too slow, then a second drive failure risked total data loss. Even worse, these random transient failures blind you to real drive failures, like the one that ate my NAS box last weekend.
- Actual drive failures can hang the kernel. I’ve had at least 3 cases at home where broken drives either caused system lockups or completely kept the system from booting. That sucks. Odds are some drivers are good while others are broken; apparently I’ve just had bad luck.
- None of Linux’s filesystems are particularly resilient in the face of on-disk data corruption. Compare with ZFS, which checksums everything that it reads or writes.
In short: everything works great when things are perfect, but building a reliable multi-drive storage system requires careful component and kernel compatibility work, and then you have to stay right on top of things if you want everything to keep working. When things stop working, they usually fail badly. That’s almost the complete antithesis of what I want for home: plug it in, and it just keeps working. I don’t want small failures to cascade through the system. Little failures should isolated, identified, and automatically repaired whenever possible. OpenSolaris and ZFS seems to provide that, while Linux with md and ext3 does not.
That’s why I’m planning on using ZFS. My logic for building a server vs. buying another little NAS box is simple: none of the little NAS boxes on the market use ZFS right now, and none of the cheap ones have room for more then 5 drives. I’m planning on using a double-parity system (RAID 6 or ZFS’s raidz2, where the system can cope with a 2-drive failure) plus a spare drive, and that’d only leave me with 2 data disks. The only way that I can get enough data with only 2 disks would be to use 1TB drives, and they’re too pricy right now.
So, I’m willing to spend the time to build a somewhat complex server because I believe (hope?) that it’ll save me time in the future, and it’ll let me avoid ever having to do the reconstruct-from-the-source dance again. I don’t think I lost anything critical last weekend, and I’m reasonably confident that I’ll be able to get things limping along well enough to recover data anyway, but I’ve now done this 3 times in the past 4 years, and I’ve had it.
Coming up soon: backups, OpenSolaris hardware compatibility, and GC-RAMDISK performance benchamarks. Stay tuned :-).
Posted by Scott Laird 400 days ago
So, the comments on yesterday’s post about my nasty RAID failure encouraged me to spend some time looking at ZFS on OpenSolaris, and I really like what I see. I’ve ordered some new hardware, so I should have lots to write about by next weekend.
Reading the ZFS docs reminded me of my Holy Grail of Storage: a storage system that could actually do reasonably smart things with 3–5 drives. Imagine a system where you could start with 3 drives and simply plug new drives in as you need more space, without worrying about RAID or data layout. When you run out of slots, then just unplug the oldest, smallest drive and plug in a new, larger one, and the data will resync, giving you more disk space without needing any special work on your part. For bonus points, you’d be able to designate specific bits of your data as more or less important, so Bittorrent files might not be replicated at all, while your Word documents might be replicated onto every available drive.
I’ve wanted that for years, but I’ve largely dismissed it as a pipe dream, because it doesn’t fit cleanly into the drive/RAID/LVM/filesystem model that everything uses. The only thing that I’ve seen that even comes close is Drobo, and it’s supposedly fairly slow and really just too “magic” for me to trust.
I realized this morning that it’d be easy to build a storage system like this using ZFS. Just create a zpool with 3 drives to start, and then create zfs filesystems with copies=2 on top of it. When you add new drives, just add them to the pool. Blindly removing a single old drive will only leave you with a single copy of some of your files, but that shouldn’t be fatal, and ZFS can copy everything off of it if you give it a chance. There are some corner cases that will give you less redundancy–if you manage to fill the system 98% full before adding a new drive, then all of the replicas of new data will probably end up on the same disk. There are a couple obvious workarounds, and Sun will probably add replication rebalancing at some point, if it isn’t there already.