Dealing in Terabytes of Image Data
The rapid pace at which I acquire data—somewhere betwee 1-2TB/yr—means that I’ve cycled through many a storage setup over the last few years. At one point, my entire photo library fit on a single drive. I’d expect this is true for most photographers in this day where 2TB drives can be had for ridiculously low prices.
When all your data fits on a single drive, the strategy is easy. A primary drive for your active library. A secondary drive for the backup that stays at home. And a third drive to make a backup that lives somewhere else, such as your safe deposit box. Easy. Clean. Simple. If you fall into this category, count your lucky stars. The only reason to read onward is for entertainment value.

When you exceed the amount of data you can comfortably put on a single disk, it all gets way too complicated. Quickly. And it’s not like you can jam 2TB of data on a 2TB disk. No no no no nooooo. You really don’t want to run a filesystem more than 75-80% full. So, really, you have to bite the bullet at 1.5TB if you’re using 2TB disks.
A few years ago, when I exceeded what I could comfortably put on a 750GB hard drive (the biggest that you could reasonable get your mitts on at the time), I decided to stay away from anything too terribly complex and to continue on the same basic idea. Times two. Everything before 2006 on one drive, everything after on another. That meant two onsite and two offsite backup drives. A bit bulky, but not too bad. It seemed like it would be fairly simple to linearly scale this approach.
Except that it’s not quite linear if you want to maintain a nice YYYY/MM/DD folder structure. To do that and stay within a 75-80% cushion means that you give up even more space. Luckily, hard drive capacities quickly inflated and I simply upgraded my 2 primary + 2 onsite backup + 2 offsite backup strategy from 750GB disks to 1.5TB disks and finally 2TB.
My data acquisition rate, however, ended up out-running the ability for Seagate, Hitachi or Western Digital to keep up. I ended up going to 3+3+3 drives late last year. By the middle of this year—in part thanks to my dabbling in video—I could see 4+4+4 on the horizon. And really, 5+5+5 was going to happen in 2011, thanks to the lost space that the 25-35% buffer demands. It was clear that it was time to do what I hadn’t wanted to do all these years: Go RAID.
I can clearly hear some of you dear readers shouting at me right about now. ”Delete more photos! Surely you don’t need to keep all of them!?!?!?” Damn. I wish that helped. I do, in point of fact, delete obvious dingers. I even delete some of the ones that aren’t obvious. If I didn’t do that, you don’t even want to know how much data I’d have on hand. I’d guess it’d be past 8TB by now. Seriously.
So, RAID. Let’s get one thing out of the way right now. RAID isn’t magic. It’s not backup. And each kind of RAID has it’s own consequences. Even today’s spiffy Beyond RAID and RAID-X devices that purport to make things better don’t solve it all. RAID, as we know it, is a power tool. It solves some problems fairly well, but gives you the ability to screw things up just as fast and with tons more data than ever before if you mess up. I’ve worked with big-ass RAID setups in data centers and that’s precisely why I didn’t want to go there at home all these years.
A quick aside: In a perfect world where Sun hadn’t gotten weird, maybe ZFS would have saved us by now at the workstation level like it has many a sysadmin in the server farms. But something happened with legal types and Apple bugged out of ZFS and now we’re stuck waiting to see if something better comes out of the Infinite Loop. Then, Sun got bought by Oracle which is even weirder, as Google is finding out. It’s probably a good thing Apple got cold feet.
Anyway, it’s 2010. 2TB hard drives are the norm. And if you want to stack a bunch of ‘em together into a single large volume, you’re looking at RAID in one form or another. If you condense all the gobbly-gook mumbo-jumbo and cut through a lot of the crap, I personally think it boils down to the following options, at least at the SOHO level:
- You can glue disks together using RAID 0. You can put 4 2TB drives together, for example, and you get an 8TB volume. With most disks, you get a helluva speed pickup at the same time. It’s not quite linear, but 400MB/s isn’t uncommon for a 4 disk array these days if each disk is on its own SATA channel. If you’re editing video, you need this kind of storage performance. Even with photos, every little bit can be nice. Of course, one disk goes south in your array and all the data vaporizes. Instantly.
- You can moosh together disks in a RAID5/RAIDX/Beyond RAID form that does parity checking and sometimes volume expansion. This lets a disk (or two in some configurations) blow out and you don’t have to worry. Just replace the bad disk and go on with life. You do need hardware to do this well, however, and a lot of the consumer and SOHO hardware isn’t all that fast. Drobos, for example, are awesomely flexible when it comes to tossing disks of various sizes at them, but they’re slow. At least they are slow to me. 20-40MB/s is not very quick these days and I’ve gotten tired of waiting on backup syncs to my current Drobo. There are faster options out there, but if you’re going to go there, it quickly makes sense to look at the next level.
- You can get a network server, also known as a NAS, which uses a RAID5/RAIDX/BeyondRAID solution. This is data center thinking that’s now being applied to the SOHO market. The first iterations of NAS devices for mere mortals were pretty pokey, but that’s now changing. Some boxes can now saturate GigE networks and give you 100MB/s. If you get fancy with network topology and buy the expensive switches, you can see much more than that.
- Last, if you’ve got a big budget, you can buy some Xserves, some Promise RAIDs, and go nuts. But that’s simply a more expensive version of the last point, and since I don’t have that budget, it’s moot as far as I’m concerned.
Given that, I pondered for a while. Weeks. Months, it felt like. Not full time, mind you. But, every time I was on an airplane, I doodled on napkins trying to decide what I was going to do to balance out the pros and cons of each approach available. This week, up against a wall with data storage and needed to do something about it, I finally decided that I would pursue a strategy that incorporated the best of two different RAID approaches, giving speed where I wanted it most and safety where I needed it.

For my local working dataset—where I feel the need for speed—I built a 8TB RAID 0 out of 4 Hitachi 7K2000 2TB drives in a FirmTek enclosure. Lloyd Chambers has benchmarked this exact array running at a peak of 500MB/s when installed inside a Mac Pro. At 50% full, it runs north of 400MB/s. My array won’t run this fast in my enclosure, but that’s because I bought a port-multiplied enclosure a while back that was more optimized for my previous strategy. The theoretical max for my current enclosure should be a bit under 300MB/s. Not shabby. And if I need the speed the drives can give, I can change enclosures and SATA cards.
Next, for on-site backup purposes, I picked up a six-bay ReadyNAS Pro Pioneer and stuffed it with a bunch of 2TB Seagate drives that I’ve been accumulating. I’ve set it up for dual disk redundancy. This means that it can weather having two disks die. With a full set of 6 disks, it’s the perfect size to back up my 8TB working array over the network. Even better, the 2TB drives I put into it are from a wide range of lots as I’ve been picking up them up in ones and twos from different sources over the six months or so. This helps increase the odds that there won’t be a double or triple disk failure.
Once implemented, instead of a bunch of volumes all over my Finder, this will give me one big volume on my desktop. Then, there’s a big volume out on the network to be the safe copy. Better yet, thanks to the fact that the ReadyNAS has an rsync server, I don’t even have to mount the drive via the Finder. I can just launch a script to sync things up in the background and keep my desktop nice and tidy. Finally, the ReadyNAS also has a feature that scans all the data on the drives every week and checks for bit rot.
Inevitably, at some point, I’ll need to expand again. I figure I’ll run into the wall again late next year at current rates. I hope that 3TB drives will be common at that point and that I can move from 8TB to 12TB in a nice smooth move. If, on the other and, data piles up faster than that, it’s almost certainly going to be because of video. In that case, the solution will be straightforward. I’ll repeat the double-down strategy, but split photos and video apart into their own local and remote arrays.
For the time being, however, I’ve got working room again. And a sane plan that gets me a bit further down the road. At least a year or 18 months. Maybe a bit more.
“Wait a minute,” you say. “What about those offsite backups?” Well, dear reader, those are going to remain on portable external drives for now that I cart back and forth to the bank. It’s not a great solution, but it works acceptably well. I still hope to replace this with a better solution someday. The ReadyNAS boxes have the ability to sync with another ReadyNAS in a remote location. That could be an interesting thing to investigate if I end up liking the box I have here now.