|
|||||
|
|
Let's Bid Adieu to Block Devices and SCSIBy Henry NewmanJanuary 5, 2005
File System and Access Patterns From what I have seen, in both databases and HPC computing, files are often read with skip increments. Most file systems, when reading data with buffered I/O, readahead if the file system is reading sequential addresses. Remember, just because it is a sequential address on the file system does not mean that the addresses are sequential on the devices. Given the latency differences over the last 25 years between CPUs and storage devices, you need to have large I/O requests that allow the RAID device to readahead sequential blocks (see Storage I/O and the Laws of Physics for more information on this issue). If you are reading I/O with direct I/O (open with O_Direct), the issue is the same. File systems just do not issue readaheads except for sequential block addresses. The application could issue an asynchronous readahead, and for some databases this does happen, but file systems just do not have the intelligence built in to allow this. Striping Many file systems use a volume manager to stripe the data across all of the devices in the file system. This defeats any potential for sequential allocation on each of the blocks on the individual disks in a LUN and therefore on the readahead cache on the RAID. It should be noted that a number of file systems have added round-robin allocation (see the Physics article cited above for more details) as an additional allocation method. Most Linux volume manager and file systems that are combined with volume managers do not support round-robin allocation, which means that most Linux I/O will not use the RAID cache efficiently. RAID Rebuild If a disk within a RAID LUN goes bad, the RAID set must be rebuilt. With RAID-5, this means reading in the data from the good disks left and writing them out again to the same disks plus an additional hot spare. Take the example of a 2Gb FC RAID-5 8+1, 146 GB drives with two 2Gb channels connecting the disks. To rebuild, most RAIDs read a stripe in and then write the same stripe out, one stripe at a time. Therefore you will have:
Having 50 percent efficiency is certainly not unheard of, and having your RAID take more than three hours to rebuild is a long time to have application performance degradation and exposure to another disk failure. Think about what happens with 400GB SATA drives and instead of an 8+1, use a 15+1, which is commonly used on some of the multimedia systems that I have worked on:
More than 17 hours for rebuild time at 50 pecent efficiency is unacceptable. This problem is going to get worse, not better, over time. Conclusions I hope have stated my case well enough so that I won't get too much hate mail, but in my heart of hearts I believe the time has come for SCSI and block devices to be replaced. In the next article, we will cover the technology that I think will replace both of these technologies, and, if adopted, could also replace file systems as we know them. All of this would be a good thing, in my opinion. That technology is called Object Storage Device, or OSD. Feature courtesy of Enterprise Storage Forum.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
|
|
|
|
||||||||||||||||||||||||||||||