The following is a combination of notes from Wes Brown’s presentation in June 2011, with edited details from his http://sqlserverio.com/ post on this subject, and my own notes and comments from the webinar. Wes, approving of this write up, also provided PDFs+PPT links for this presentation (I have cropped the images below for ease of reading):http://sqlserverio.files.wordpress.com/2011/06/idera-sql-server-storage-and-you-part-3.pdf
http://sqlserverio.files.wordpress.com/2011/06/idera-sql-server-storage-and-you-part-3.pptx
http://sqlserverio.files.wordpress.com/2011/06/idera-sql-server-storage-and-you-part-3.pptx
Wesley Brown’s Background: DBA Consultant from Dell He is very active in the SQL Server community serving on the PASS Summit program committee and is coordinator of the Austin, TX PASS chapter CACTUSS and founder of San Antonio PASS chapter
How to Contact Master Brown J
Presentation starts with a bit of advertising to sit thorough, but I like Idera´s products, so they have the right to some self-promotion here. I use their Admin Toolset every day practically. Moving on…
How to Contact Master Brown J
Presentation starts with a bit of advertising to sit thorough, but I like Idera´s products, so they have the right to some self-promotion here. I use their Admin Toolset every day practically. Moving on…
Solid state storage has burst onto the database scene in the last three years with amazing performance numbers and a price tag that brings it within reach. Covering the basics of NAND Flash, the foundation of solid state storage, you should be able to ask the right questions and evaluate if solid state is the best option for your growing data needs.
Solid State is the new kid on the block. Wes warns that SSDs are not necessarily magic pixie dust to have immediate performance!
Solid state storage is the new kid on the block. We see new press releases every day about just how awesome this new technology is. Like with any technology, you need a solid foundation in how it works before you can decide if it is right for you. Let’s review what solid state storage is and where it differs from traditional hard disks.
Presentation is a general and impartial coverage of SSD technology.
NAND is a suitable replacement for block based hardware.
Flash Memory
Flash is a type of memory like the RAM in your computer. There are several key differences though. First NAND is non-volatile, meaning it doesn’t require electricity to maintain the data stored in it. It also has very fast access times, not quite RAM’s access times but in between RAM and a spinning hard disk. It does wear out as you write to it over time. There are several types of Flash memory. The two most common type are NAND and NOR. Each has its benefits. NOR has the ability to write in place and has consistent and very fast read access times but very slow write access times. NAND has a slower read access time but is much faster to write to. This makes NAND more attractive for mass storage devices.
The Structure of NAND Flash
NAND stores data in a large serial array of transistors. Each transistor can store data. NAND Flash arrays are grouped first and written into pages. A page consists of data space and spare space. Spare space is physically the same as data space but is used for things like ECC and wear-levelling (512k is used for error correction at the end). Usually, a page is 4096 bytes for data and 1 to 4 bits of spare for each 512 bytes of data space. Pages are grouped again into blocks of 64 to 128 pages, which is the smallest erasable unit, thus with a SSD, we only erase in blocks. There can be quite a few blocks per actual chip, as many as 16 k blocks or 8 GB worth, on a single chip. The parallel arrangement grouping of controllers is where we get our speed – and it makes them look like one large SSD. On an Intel 80GB X25 SSD I purchased, last year, I opened it up once I to see 10x8GB chips, for example. Form factors are a standard 2.5” or 3.5” drive or PCIe device.
Multi-level or Single-Level Cells, before we dig into more details :
Multi-level or Single-Level Cells, before we dig into more details :
A piece of Nand flash you can only write to about 10k times, and it gets worse over time
SLC is harder to manufacture, so more expensive. I assume that over time, SLC will drop in price with demand.
NAND Read Properties
NAND Flash operates differently than RAM or a hard disk. Even though NAND is structured in pages like hard disks, this is where the similarities end. NAND is structured to be access serially. As a type of memory, NAND flash is a poor choice for random write access patterns. A 15,000 RPM hard disk may have a random access seek time of 5.5 milliseconds. It has to spin a disk and position the read/write head. NAND on the other hand doesn’t actually seek. It does a look up and reads the memory area, which takes between 25 to 50 nanoseconds. It has the same read time no matter the type of operation random or sequential. A single NAND chip may be able to read between 25 and 40 megabytes a second. So, even though it is considered a poor performer for random IO, it is still orders of magnitude faster than a hard disk.
Metrics: Rand seek time of 50 nanoseconds.
Remember, SSDs are not hard disks, but chips, and there are no moving parts.
Fusion I/O card are aggregated Nand chips together, that’s how we get amazing performance.
Random access i/o is bad though, so Memory will not be replaced.
Metrics: Rand seek time of 50 nanoseconds.
Remember, SSDs are not hard disks, but chips, and there are no moving parts.
Fusion I/O card are aggregated Nand chips together, that’s how we get amazing performance.
Random access i/o is bad though, so Memory will not be replaced.
NAND Write Properties
NAND Flash has a much faster read speed than write speed. The same NAND chip that reads at 40 megabytes a second may only sustain 7 megabytes a second in write speed. Average write speed is 250~ nanoseconds. This figure only includes programming a page. Writing to flash can be much more complicated if there is already data in the page.
NAND Flash does not support an over-write state.
We worry about NAND Flash and how it fails to erase, but still actually readable. A Flash disk for example can often be still readable, but not writable anymore… I have bought cheap Flash thumb drives and seen this write issue, after only a short period of use – beware cheap space, only writable a few times; you get what you pay for!
NAND is written in pages, but erased in blocks – here’s the explanation why: It’s called the Program Erase CycleNAND does writes based on a program erase (P/E) cycle. When a NAND block is considered erased all bits are set to 1. As you program the bit you set it to 0. Program cycle writes page at a time and can be pretty quick. NAND doesn’t support a overwrite mode where a bit, page or even block can be overwritten without first being reset to a cleared state. The P/E cycle is very different from what happens on a hard disk where it can overwrite data without first having to clear a sector. Erasing a block takes between 500 nanoseconds to 2 milliseconds. Each P/E cycle wears on the NAND block. After so many cycles the block becomes unreliable and will fail to program or erase (thus the issue mentioned above with cheap flash memory still being readable, but not writeable).
Wear-Levelling
To mitigate the finite number of P/E cycles a NAND chip has we use two different techniques to keep them alive or make sure we don’t use a possible bad block again. Let’s take a single NAND MLC chip. It may have 16 thousand blocks on it. Each block may be rated between 3,000 to 10,000 P/E Cycles. If you execute a P/E cycle on one block per second it would take you over five years to reach the wear out rating of 10,000 cycles. If on the other hand you executed a P/E cycle on a single block you could hit the 10,000 rating in about 3 hours! This is why wear-levelling is so important. In the early days of NAND flash wearing out a block was a legitimate concern as applications would just rewrite the same block over and over. Modern devices spread that over not just a single chip but every available chip in the system, in other words, extending the life of your solid state disk for a very, very long time. Ideally, you want to write to each block once before writing the second block. That isn’t always possible due to data access patterns.
NAND Flash does not support an over-write state.
We worry about NAND Flash and how it fails to erase, but still actually readable. A Flash disk for example can often be still readable, but not writable anymore… I have bought cheap Flash thumb drives and seen this write issue, after only a short period of use – beware cheap space, only writable a few times; you get what you pay for!
NAND is written in pages, but erased in blocks – here’s the explanation why: It’s called the Program Erase CycleNAND does writes based on a program erase (P/E) cycle. When a NAND block is considered erased all bits are set to 1. As you program the bit you set it to 0. Program cycle writes page at a time and can be pretty quick. NAND doesn’t support a overwrite mode where a bit, page or even block can be overwritten without first being reset to a cleared state. The P/E cycle is very different from what happens on a hard disk where it can overwrite data without first having to clear a sector. Erasing a block takes between 500 nanoseconds to 2 milliseconds. Each P/E cycle wears on the NAND block. After so many cycles the block becomes unreliable and will fail to program or erase (thus the issue mentioned above with cheap flash memory still being readable, but not writeable).
Wear-Levelling
To mitigate the finite number of P/E cycles a NAND chip has we use two different techniques to keep them alive or make sure we don’t use a possible bad block again. Let’s take a single NAND MLC chip. It may have 16 thousand blocks on it. Each block may be rated between 3,000 to 10,000 P/E Cycles. If you execute a P/E cycle on one block per second it would take you over five years to reach the wear out rating of 10,000 cycles. If on the other hand you executed a P/E cycle on a single block you could hit the 10,000 rating in about 3 hours! This is why wear-levelling is so important. In the early days of NAND flash wearing out a block was a legitimate concern as applications would just rewrite the same block over and over. Modern devices spread that over not just a single chip but every available chip in the system, in other words, extending the life of your solid state disk for a very, very long time. Ideally, you want to write to each block once before writing the second block. That isn’t always possible due to data access patterns.
It sounds simple enough to cycle through all available blocks before triggering a P/E cycle but in the real world it just isn’t that easy. As you fill the drive with data it is generally broken into two different categories, static data and dynamic data. Static data is something that is written once, or infrequently, and read multiple times. Something like a music file is a good example of this. Dynamic data covers things like log files that are written to frequently, or in our case database files. If you only wear-level the dynamic data you shorten the life of the flash significantly. Alternatively if you also include the static data you are now incurring extra write and read IO in the back ground that can affect performance of the device.
If I have a SSD and wrote a cycle per second per block, at that rate, the spot would be worn out in five years, but it is not something to obsess about. You could wear out a piece of NAND very quickly.
It all becomes very complicated – a write can make additional I/O for the above reasons.
If I have a SSD and wrote a cycle per second per block, at that rate, the spot would be worn out in five years, but it is not something to obsess about. You could wear out a piece of NAND very quickly.
It all becomes very complicated – a write can make additional I/O for the above reasons.
Background Garbage Collection
To defer the P/E cycle and mitigate the penalty of a block erase we rely on garbage collection running in the background of the device. When a file is altered it may be completely moved to clean pages and blocks, the old blocks are now marked as dirty. This tells the garbage collector that it can perform a block erasure on it at any time. This works just fine as long as the drive has enough spare area allocated and the number of write request is low enough for the garbage collector to keep up. Keep in mind, this spare area isn’t visible to the operating system or the file system and is independent of them. If you run out of free pages to program you start forcing a P/E cycle for each write slowing down writes dramatically. Some manufacturers off set this with a large DRAM buffer and also may allow you to change the size of the over provisioned space.
Write application – as a device moves blocks around, then writes will slow down due to i/o overhead.
Write Amplification - Another pitfall of wear-levelling and garbage collection is the phenomenon of write amplification. As the device tries to keep up with write request and garbage collection it can effectively bring everything to a standstill. Again, writing serially and deleting serially in large blocks can mitigate some of this. Unfortunately, SQL Server access patterns for OLTP style databases means lots of little inserts, updates and deletes. This adds to the problem. There may be enough free space to accommodate the write but it is severely fragmented by the write pattern and a large amount of garbage collection is needed. TRIM can help with this if you leave enough free space available. This also means factoring free space into your capacity planning ahead of time. A full solid state device is a poor performing one when it comes to writes, so do not let your SSDs fill up!
To defer the P/E cycle and mitigate the penalty of a block erase we rely on garbage collection running in the background of the device. When a file is altered it may be completely moved to clean pages and blocks, the old blocks are now marked as dirty. This tells the garbage collector that it can perform a block erasure on it at any time. This works just fine as long as the drive has enough spare area allocated and the number of write request is low enough for the garbage collector to keep up. Keep in mind, this spare area isn’t visible to the operating system or the file system and is independent of them. If you run out of free pages to program you start forcing a P/E cycle for each write slowing down writes dramatically. Some manufacturers off set this with a large DRAM buffer and also may allow you to change the size of the over provisioned space.
Write application – as a device moves blocks around, then writes will slow down due to i/o overhead.
Write Amplification - Another pitfall of wear-levelling and garbage collection is the phenomenon of write amplification. As the device tries to keep up with write request and garbage collection it can effectively bring everything to a standstill. Again, writing serially and deleting serially in large blocks can mitigate some of this. Unfortunately, SQL Server access patterns for OLTP style databases means lots of little inserts, updates and deletes. This adds to the problem. There may be enough free space to accommodate the write but it is severely fragmented by the write pattern and a large amount of garbage collection is needed. TRIM can help with this if you leave enough free space available. This also means factoring free space into your capacity planning ahead of time. A full solid state device is a poor performing one when it comes to writes, so do not let your SSDs fill up!
These three slides give a good explanation of what is happening while pages are written to Blocks, and how a block at a time is erased.
Garbage collection cycle.
TRIM
Another technology that has started to gain momentum is the TRIM command. Fundamentally, this allows the operating system and the storage device to communicate about how much free space the file system has and allows the device to use that space like the reserve space or the over provisioned space used for garbage collection. The down sides are it is really only available in Windows 7 and Windows Server 2008 R2. Some manufacturers are including a separate TRIM service on those OS’es that don’t support it natively. Also, TRIM can only be effective if there is enough free space on the file system. If you fill the drive to capacity then TRIM is completely useless. Another thing to consider is an erasable block may be 256 KB and we generally format our file system for SQL Server at 64KB several times smaller than the erasable block. Last thing to remember, and it is good advice for any device not just solid state storage, is grow your files in large chunks to keep file fragmentation down to a minimum. Heavy file fragmentation also cuts down on TRIM’s performance and can’t be easily fixed since running a defragment may actually make the problem worse as it forces whole sale garbage collection and wears out the flash that much faster.
TRIM okay with Win 7 and 2008 R2. If you have filled up the drive, you have messed up TRIM, so do not fill up a SSD. Do not defrag a SSD. Defragmentation wears out a SSD (oops, kicking myself now quietly), let TRIM do its job.
Error Detection and Correction
Garbage collection cycle.
TRIM
Another technology that has started to gain momentum is the TRIM command. Fundamentally, this allows the operating system and the storage device to communicate about how much free space the file system has and allows the device to use that space like the reserve space or the over provisioned space used for garbage collection. The down sides are it is really only available in Windows 7 and Windows Server 2008 R2. Some manufacturers are including a separate TRIM service on those OS’es that don’t support it natively. Also, TRIM can only be effective if there is enough free space on the file system. If you fill the drive to capacity then TRIM is completely useless. Another thing to consider is an erasable block may be 256 KB and we generally format our file system for SQL Server at 64KB several times smaller than the erasable block. Last thing to remember, and it is good advice for any device not just solid state storage, is grow your files in large chunks to keep file fragmentation down to a minimum. Heavy file fragmentation also cuts down on TRIM’s performance and can’t be easily fixed since running a defragment may actually make the problem worse as it forces whole sale garbage collection and wears out the flash that much faster.
TRIM okay with Win 7 and 2008 R2. If you have filled up the drive, you have messed up TRIM, so do not fill up a SSD. Do not defrag a SSD. Defragmentation wears out a SSD (oops, kicking myself now quietly), let TRIM do its job.
Error Detection and Correction
The nature of NAND Flash makes it susceptible to several types of data corruption. In the same way that hard drives and floppy disks are at risk near magnetic sources, NAND has several vulnerabilities, some of them occur even when just reading data.
Write Disturb
Data in cells that aren’t being written to can be corrupted by writing to adjacent cells or even pages, this is called Program Disturb. Cells not being programmed receive elevated voltage causing them to appear weakly programmed. There isn’t any damage to the physical structure and can be cleared with a normal erase.
Write Disturb
Data in cells that aren’t being written to can be corrupted by writing to adjacent cells or even pages, this is called Program Disturb. Cells not being programmed receive elevated voltage causing them to appear weakly programmed. There isn’t any damage to the physical structure and can be cleared with a normal erase.
Read Disturb
Reading repeatedly from the same block can also have a similar effect call Read Disturb. Cells not being read collects a charge that causes it to appear to be weakly programmed. The main difference from Write Disturb is it is always on the block being read and always on pages not being read. Again, the physical cell isn’t damaged and an erase on the effected block clears the issue.
Reading repeatedly from the same block can also have a similar effect call Read Disturb. Cells not being read collects a charge that causes it to appear to be weakly programmed. The main difference from Write Disturb is it is always on the block being read and always on pages not being read. Again, the physical cell isn’t damaged and an erase on the effected block clears the issue.
Charge Loss/Gain
Lastly, there is an issue with data retention on cells over time. The charge on a floating gate over time may gain or lose charge, making them appear to be weakly programmed or in another invalid state. The block is undamaged and can still be reliably erased and written to. All of this sounds just as catastrophic as it gets. Fortunately, error correcting code (ECC) techniques effectively deal with these issues (especially for enterprise level SSDs, ECC is bomber).
Bad Block Management
NAND chips aren’t always perfect. Every chip may have defects of some sort and will ship from the factory with bad blocks already on the device. Bad block management is integrated into the NAND chip. When a cell fails a P/E cycle the data is written to a different block and that block is marked bad making it unavailable to the system.
Not all drives are created equal. Make sure you read the specs of the drive you are buying! The benchmarks are often at 4k blocks, whereas with SQL Server, you should cut that in half because we are using 8k blocks. As stated in the slide detail - be very careful for the short-stroking, queue depths and odd block transfer sizes, because these SSDs are still a maturing technology. Often, on vendor sites do not find these specs, all you get are a vague metrics, such as 285MB/s read and 275MB/s write on a OCZ Agility 2 series I picked up recently (although it screams for now and I do not care for the moment), meaning you need to dig on the manufacturer’s site and find product ratings.
If your tuning is for low queue depths, you will not see a great change in I/O.
Not all drives are created equal. Make sure you read the specs of the drive you are buying! The benchmarks are often at 4k blocks, whereas with SQL Server, you should cut that in half because we are using 8k blocks. As stated in the slide detail - be very careful for the short-stroking, queue depths and odd block transfer sizes, because these SSDs are still a maturing technology. Often, on vendor sites do not find these specs, all you get are a vague metrics, such as 285MB/s read and 275MB/s write on a OCZ Agility 2 series I picked up recently (although it screams for now and I do not care for the moment), meaning you need to dig on the manufacturer’s site and find product ratings.
If your tuning is for low queue depths, you will not see a great change in I/O.
To protect data, a DBA’s primary role, we can rely more on SLC, since it is a lot more robust, whereas the MLC needs a lot more space. eMLC is pricey, and is better for a reason.
E.g. half the Fusion i/o card should be used for read – and you need to understand what is under the covers that make it an enterprise level card. How many writes before it dies? Ask Manufacturer – do your own math according to your vendor’s specs vis-à-vis your needs.
Life span of the drive could be very short, depending on the use. Consumer drives are not rated for full writes, and not continuous writes – but if you are heavily using the disk, it may not last a couple of years even. NAND drives tend to roll off the same production line, but have different grades, so watch the specs for the grades.
SATA is not made to be enterprise level, SAS is however. Much higher queue depths and ECC.
SMART monitoring is not effective on SSD. SCSI is much better.
SAS has two full-duplex ports, whereas SATA does not.
You can get fibre attached SAS. Raid 0 can be unreliable on SSD, so do not jump into that.
As mentioned above, understand your workload before buying a SSD.
SSDs are never as fast from day one, they get worse over time because cells wear out, even if ECC saves you, the I/O. SSD manufacturers are starting to catch onto this and have put mitigators in place to prevent slow eventual garbage collection. In the second year, expect your SSD to be maybe 10% slower than your first year.
If you buy the cheapest drive, you’ll get burned over time. Invest appropriately
The application of firmware updates might wipe a drive…be very careful of these updates.
Summary
As you can see, Flash in some respects is much more complicated that your traditional hard disk. There are many things you must consider when implementing solid state storage.
Takeaways
Flash read performance is great, sequential or random.
Flash write performance is complicated, and can be a problem if you don’t manage it.
Flash wears out over time. Not nearly the issue it used to be, but you must understand your write patterns.
Plan for over provisioning and TRIM support it can have a huge impact on how much storage you actually buy. Flash can be error prone. Be aware that writes and reads can cause data corruption.
Presentation Questions – read / write questions. Stick with hard disk for now on a BI environment.
SSD can reduce power consumption however, so for larger data centers, consider reducing electricity costs.
Is there potential of losing data? On the Enterprise level, not an issue, but there will be a consistency check on the drive during the boot back up from sudden power off.
It is worth using a RAID 5 setup for these disks.
Catastrophic failures of these drives are not unheard of at all, so RAID 1 or Raid 10 or critical data should be considered, despite the cost.
Do blocks wear our or the entire disk? What is going on? When you cannot erase a block, it is all marked as unusable. It is taken offline when problematic by the SSD system.
Much praise for Fusion I/O because it is ready for Enterprise level.
Temp DB – nice high mix/reads/writes – 3 years ago iffy, but now, just mirror it at least.
All SAN /NAS vendors are offering SSD – Super Cache level/layer.
Best practices for SQL Server on SSD:
if on textbook setup, with proper amounts of empty space not totally necessary.
If you are on an overloaded SAN, or overloaded disks, throw in a SSD where you have poor performance.
Similarly – for I/O issues, drop in a SSD for the silver bullet effect.
Raid 1 or Raid 10, not Raid 5 recommended due to overhead.
Write-caching okay on Enterprise level drives, no-caching flag should be honoured.
Use Windows Server 2008 R2 to take advantage of TRIM.
Capture your workload to fully understand. Reads are best on a SSD.
Thanks to Idera for hosting this great informative presentation and to Wesley Brown for all this content!
No comments:
Post a Comment