How to Contact Master Brown J
Presentation starts with a bit of advertising to sit thorough, but I like Idera´s products, so they have the right to some self-promotion here. I use their Admin Toolset every day practically. Moving on…
Solid state storage is the new kid on the block. We see new press releases every day about just how awesome this new technology is. Like with any technology, you need a solid foundation in how it works before you can decide if it is right for you. Let’s review what solid state storage is and where it differs from traditional hard disks.
Presentation is a general and impartial coverage of SSD technology.
NAND is a suitable replacement for block based hardware.
The Structure of NAND Flash
Multi-level or Single-Level Cells, before we dig into more details :
A piece of Nand flash you can only write to about 10k times, and it gets worse over time
SLC is harder to manufacture, so more expensive. I assume that over time, SLC will drop in price with demand.
Metrics: Rand seek time of 50 nanoseconds.
Remember, SSDs are not hard disks, but chips, and there are no moving parts.
Fusion I/O card are aggregated Nand chips together, that’s how we get amazing performance.
Random access i/o is bad though, so Memory will not be replaced.
NAND Flash does not support an over-write state.
We worry about NAND Flash and how it fails to erase, but still actually readable. A Flash disk for example can often be still readable, but not writable anymore… I have bought cheap Flash thumb drives and seen this write issue, after only a short period of use – beware cheap space, only writable a few times; you get what you pay for!
NAND is written in pages, but erased in blocks – here’s the explanation why: It’s called the Program Erase CycleNAND does writes based on a program erase (P/E) cycle. When a NAND block is considered erased all bits are set to 1. As you program the bit you set it to 0. Program cycle writes page at a time and can be pretty quick. NAND doesn’t support a overwrite mode where a bit, page or even block can be overwritten without first being reset to a cleared state. The P/E cycle is very different from what happens on a hard disk where it can overwrite data without first having to clear a sector. Erasing a block takes between 500 nanoseconds to 2 milliseconds. Each P/E cycle wears on the NAND block. After so many cycles the block becomes unreliable and will fail to program or erase (thus the issue mentioned above with cheap flash memory still being readable, but not writeable).
To mitigate the finite number of P/E cycles a NAND chip has we use two different techniques to keep them alive or make sure we don’t use a possible bad block again. Let’s take a single NAND MLC chip. It may have 16 thousand blocks on it. Each block may be rated between 3,000 to 10,000 P/E Cycles. If you execute a P/E cycle on one block per second it would take you over five years to reach the wear out rating of 10,000 cycles. If on the other hand you executed a P/E cycle on a single block you could hit the 10,000 rating in about 3 hours! This is why wear-levelling is so important. In the early days of NAND flash wearing out a block was a legitimate concern as applications would just rewrite the same block over and over. Modern devices spread that over not just a single chip but every available chip in the system, in other words, extending the life of your solid state disk for a very, very long time. Ideally, you want to write to each block once before writing the second block. That isn’t always possible due to data access patterns.
If I have a SSD and wrote a cycle per second per block, at that rate, the spot would be worn out in five years, but it is not something to obsess about. You could wear out a piece of NAND very quickly.
It all becomes very complicated – a write can make additional I/O for the above reasons.
To defer the P/E cycle and mitigate the penalty of a block erase we rely on garbage collection running in the background of the device. When a file is altered it may be completely moved to clean pages and blocks, the old blocks are now marked as dirty. This tells the garbage collector that it can perform a block erasure on it at any time. This works just fine as long as the drive has enough spare area allocated and the number of write request is low enough for the garbage collector to keep up. Keep in mind, this spare area isn’t visible to the operating system or the file system and is independent of them. If you run out of free pages to program you start forcing a P/E cycle for each write slowing down writes dramatically. Some manufacturers off set this with a large DRAM buffer and also may allow you to change the size of the over provisioned space.
Write application – as a device moves blocks around, then writes will slow down due to i/o overhead.
Write Amplification - Another pitfall of wear-levelling and garbage collection is the phenomenon of write amplification. As the device tries to keep up with write request and garbage collection it can effectively bring everything to a standstill. Again, writing serially and deleting serially in large blocks can mitigate some of this. Unfortunately, SQL Server access patterns for OLTP style databases means lots of little inserts, updates and deletes. This adds to the problem. There may be enough free space to accommodate the write but it is severely fragmented by the write pattern and a large amount of garbage collection is needed. TRIM can help with this if you leave enough free space available. This also means factoring free space into your capacity planning ahead of time. A full solid state device is a poor performing one when it comes to writes, so do not let your SSDs fill up!
Garbage collection cycle.
Another technology that has started to gain momentum is the TRIM command. Fundamentally, this allows the operating system and the storage device to communicate about how much free space the file system has and allows the device to use that space like the reserve space or the over provisioned space used for garbage collection. The down sides are it is really only available in Windows 7 and Windows Server 2008 R2. Some manufacturers are including a separate TRIM service on those OS’es that don’t support it natively. Also, TRIM can only be effective if there is enough free space on the file system. If you fill the drive to capacity then TRIM is completely useless. Another thing to consider is an erasable block may be 256 KB and we generally format our file system for SQL Server at 64KB several times smaller than the erasable block. Last thing to remember, and it is good advice for any device not just solid state storage, is grow your files in large chunks to keep file fragmentation down to a minimum. Heavy file fragmentation also cuts down on TRIM’s performance and can’t be easily fixed since running a defragment may actually make the problem worse as it forces whole sale garbage collection and wears out the flash that much faster.
TRIM okay with Win 7 and 2008 R2. If you have filled up the drive, you have messed up TRIM, so do not fill up a SSD. Do not defrag a SSD. Defragmentation wears out a SSD (oops, kicking myself now quietly), let TRIM do its job.
Error Detection and Correction
Data in cells that aren’t being written to can be corrupted by writing to adjacent cells or even pages, this is called Program Disturb. Cells not being programmed receive elevated voltage causing them to appear weakly programmed. There isn’t any damage to the physical structure and can be cleared with a normal erase.
Reading repeatedly from the same block can also have a similar effect call Read Disturb. Cells not being read collects a charge that causes it to appear to be weakly programmed. The main difference from Write Disturb is it is always on the block being read and always on pages not being read. Again, the physical cell isn’t damaged and an erase on the effected block clears the issue.
Lastly, there is an issue with data retention on cells over time. The charge on a floating gate over time may gain or lose charge, making them appear to be weakly programmed or in another invalid state. The block is undamaged and can still be reliably erased and written to. All of this sounds just as catastrophic as it gets. Fortunately, error correcting code (ECC) techniques effectively deal with these issues (especially for enterprise level SSDs, ECC is bomber).
Bad Block Management
Not all drives are created equal. Make sure you read the specs of the drive you are buying! The benchmarks are often at 4k blocks, whereas with SQL Server, you should cut that in half because we are using 8k blocks. As stated in the slide detail - be very careful for the short-stroking, queue depths and odd block transfer sizes, because these SSDs are still a maturing technology. Often, on vendor sites do not find these specs, all you get are a vague metrics, such as 285MB/s read and 275MB/s write on a OCZ Agility 2 series I picked up recently (although it screams for now and I do not care for the moment), meaning you need to dig on the manufacturer’s site and find product ratings.
If your tuning is for low queue depths, you will not see a great change in I/O.
To protect data, a DBA’s primary role, we can rely more on SLC, since it is a lot more robust, whereas the MLC needs a lot more space. eMLC is pricey, and is better for a reason.
E.g. half the Fusion i/o card should be used for read – and you need to understand what is under the covers that make it an enterprise level card. How many writes before it dies? Ask Manufacturer – do your own math according to your vendor’s specs vis-à-vis your needs.
Life span of the drive could be very short, depending on the use. Consumer drives are not rated for full writes, and not continuous writes – but if you are heavily using the disk, it may not last a couple of years even. NAND drives tend to roll off the same production line, but have different grades, so watch the specs for the grades.
SATA is not made to be enterprise level, SAS is however. Much higher queue depths and ECC.
SMART monitoring is not effective on SSD. SCSI is much better.
SAS has two full-duplex ports, whereas SATA does not.
You can get fibre attached SAS. Raid 0 can be unreliable on SSD, so do not jump into that.
As mentioned above, understand your workload before buying a SSD.
SSDs are never as fast from day one, they get worse over time because cells wear out, even if ECC saves you, the I/O. SSD manufacturers are starting to catch onto this and have put mitigators in place to prevent slow eventual garbage collection. In the second year, expect your SSD to be maybe 10% slower than your first year.
If you buy the cheapest drive, you’ll get burned over time. Invest appropriately
The application of firmware updates might wipe a drive…be very careful of these updates.
Flash read performance is great, sequential or random.
Flash write performance is complicated, and can be a problem if you don’t manage it.
Flash wears out over time. Not nearly the issue it used to be, but you must understand your write patterns.
Plan for over provisioning and TRIM support it can have a huge impact on how much storage you actually buy. Flash can be error prone. Be aware that writes and reads can cause data corruption.
Presentation Questions – read / write questions. Stick with hard disk for now on a BI environment.
SSD can reduce power consumption however, so for larger data centers, consider reducing electricity costs.
Is there potential of losing data? On the Enterprise level, not an issue, but there will be a consistency check on the drive during the boot back up from sudden power off.
It is worth using a RAID 5 setup for these disks.
Catastrophic failures of these drives are not unheard of at all, so RAID 1 or Raid 10 or critical data should be considered, despite the cost.
Do blocks wear our or the entire disk? What is going on? When you cannot erase a block, it is all marked as unusable. It is taken offline when problematic by the SSD system.
Much praise for Fusion I/O because it is ready for Enterprise level.
Temp DB – nice high mix/reads/writes – 3 years ago iffy, but now, just mirror it at least.
All SAN /NAS vendors are offering SSD – Super Cache level/layer.
Best practices for SQL Server on SSD:
if on textbook setup, with proper amounts of empty space not totally necessary.
If you are on an overloaded SAN, or overloaded disks, throw in a SSD where you have poor performance.
Similarly – for I/O issues, drop in a SSD for the silver bullet effect.
Raid 1 or Raid 10, not Raid 5 recommended due to overhead.
Write-caching okay on Enterprise level drives, no-caching flag should be honoured.
Use Windows Server 2008 R2 to take advantage of TRIM.
Capture your workload to fully understand. Reads are best on a SSD.
Thanks to Idera for hosting this great informative presentation and to Wesley Brown for all this content!