In my first blog entry on IO Determinism, I looked at the data center latency problem and the real impact in the data center and the bottom line. Next, I want to explore solutions to reduce and minimize latency.
The obvious solution always considered, and available as a feature since about 2009, is program and erase suspend. The challenge on these suspend operations is that it’s not uncommon in data center workloads to have a nearly "infinite" number of read requests and asking the erase to "suspend" forever isn't going to help.
A second option considered many times is over provisioning. With higher overprovisioning we reduce what is typically called write amplification (WA), which is a ratio of how much is being written to the NAND versus what is written by the host. It’s not uncommon for data center SSDs to have 7% overprovisioning which leads to a WA on the order of 5-7, whereas additional overprovisioning implemented in enterprise-class solutions (28%), we see a WA of about 2. So in theory, with a WA reduction of 3x, we would expect 3x less program and erase operations associated with the garbage collection. Does this help? Sadly, the figure below gives us an answer of "no":
Source: Internal Toshiba Testing, March 2017
Another option, is something called "Open Channel" (OC). In the most simplistic form, Open Channel offers a method of allowing the host complete and full control of every NAND die and NAND plane, enabling the host to make all the key decisions about how to manage the flash array. Now they can coordinate and ensure that program and erase operations are never scheduled in such a way that collides with read operations. This solution requires heavy lifting by the host; the host now owns the logical to physical mapping (sometimes called FTL or flash transition layer) and other responsibilities such as defect mapping and RAID-level redundancy. Although a very viable solution, it is also a difficult exercise for the hyperscalers given the swiftly moving flash memory technology and the fact that the host now fully owns the reliability of the device.
The NVM Express® (NVMe™) committee has come up with an alternate solution called IO Determinism (IOD). The IOD architecture model offers a similar approach of ensuring that read operations are never interfered with by NAND program or erase operations. Additionally, IOD prevents the side effects of delegating the fundamental reliability of the SSD to the host stack without significant host software changes each NAND generation. The basic concept is to create multiple "mini-SSDs" which are called "NVM Sets" under the same NVMe controller. The host can choose whether each NVM set is in "deterministic" mode (no program/erase interference) or in non-deterministic mode (garbage collection, endurance scans etc.).
The hyperscale system would manage each of these NVM sets, ensuring that at least a single drive would be in deterministic mode as shown below:
The idea is that we pair two more NVM sets (in this example three) so that at any vertical slide there is at least one NVM set in deterministic mode able to service read operations without program or erase interference.
With such a scheme, we can guarantee that any read operation can be serviced without interference:
The result? We improved latency by over an order of magnitude in a single generation. Feels good to break a paradigm that has been with us for over 30 years!
In my next blog, I will go through more of the details of implementation of an IOD SSD and results from a proof of concept that was demonstrated at Flash Memory Summit this summer.
Disclaimer
The views and opinions expressed in this blog are those of the author(s) and do not necessarily reflect those of KIOXIA America, Inc.