I'm far from an expert, but that doesn't make that much sense. If the problem is errors that accumulate during the reconstruction time, which are getting exponentially longer, then nested arrays should take care of that. I remember going over the calculations once for fairlure rate in a nsted system versus an n-parity system. I can't remember exactly what I came to, but concluded that a nested system was better. I did not take reconstruction time into consideration, however. If, as you said, the size of drives is increasing at a great rate, then it seems that the best thing to do is not use RAID 5/6 at all for the top level. The problem with a nested array is the hardware to run it. With more controllers, more things can go wrong. If you have a duplex of two nested arrays (lets say a large RAID 6 arrar of small RAID 5s with a spare RAID 5 array) wouldn't that take care of it. The chances of that many HDD dying seems very slim, but the chances of enough HDD and a controller (which may be harder to detect in advance) seems like a possibility.
We've been through
nested RAID before, and you can probably add to that thread, but the inevitable conclusion is with more nested levels, it simply becomes cost-ineffective to be using entire RAID arrays for parity data.So that's not a scalable solution (or not a cost-effective one at scale, anyway).
As far as I can tell from the article, the 2 main issues are:
1) Throughput not increasing at the same scale as disk capacity
2) Scalable efficient n-level RAID algorithm not in existence yet
The first issue leads to longer RAID rebuild times (and not errors accumulating during rebuild)), and for sufficiently large arrays the rebuild time can stretch to days, or even weeks. The article mainly attributes this to the low throughput of the drives relative to their capacity. Many degraded RAID arrays are left to run in a degraded state while they're still being rebuilt (because for some enterprises the point of a RAID is maximum uptime); that does not leave much throughput available for RAID rebuilding.
With RAID rebuild times stretching that long, the chance of a second or even third disk failing during the rebuild period increases as well. And hence the increasing need for triple-parity RAID. Extrapolating from that, a scaleable n-level RAID algorithm (the second issue) would eliminate the need to have to write a new RAID algorithm each time an additional parity level is required.
Additionally, there are other factors affecting rebuild times, such as time required for a human operator to procure a new disk and insert it into the array, but let's ignore those for now.
So at heart, it's really a throughput problem. Low throughput (relative to disk capacity) leads to longer rebuild times, longer rebuild times lead to increased risk of additional disks failing during rebuild, higher level parity needed to offset increased risk of additional disks failing.
Or to put things in perspective, it takes ~3.5hrs to fill a 2TB hard disk at 150MB/s write speed (I assume you're all astute enough to correct me on the issues and not the numbers). In the near future, we'll have 4TB disks, but disk read/write speeds are hardly doubling; a 4TB disk will take 7 hours to fill, and so on.
But I don't find throughput issues that interesting, which is why I'm focusing on the second issue: scaleable n-level RAID algorithms, and the associated computational costs.
One thing I don't understand is why you can't just use something like par2 (a archive backup program that uses a Read-Solomon ring), but with drives. It scales very well.
par2 helps repair corrupted data, it doesn't increase drive reliability, or prevent the whole array going down when the disk fails.