Accelerators/GPUs are vital for decreasing machine learning (ML) training model times due to their high-bandwidth memory and parallel architecture. GPUs can do many operations simultaneously and process large amounts of data, which nowadays can be terabytes1, or even petabytes1 in size. Which leads us to the question, what are the actual storage requirements for these very demanding AI/ML workloads?
Data-intensive workloads require underlying storage that can support large capacity, high throughput and low latency. High performance is required so that the compute resources, such as GPUs, can conduct training on different pieces of data in parallel, keeping accelerator utilization high while they are continually working. The storage system must respond quickly to multiple queries, otherwise, the preprocessing, training and inferencing processes could slow down. It must be able to handle quick data accesses, not only from one GPU, but from many GPUs simultaneously so that different models can be trained at the same time.
NVMe™ SSDs do this very well and are widely available in varying PCIe® interface generations and performance profiles. PCIe 5.0 NVMe SSDs deliver significant throughput and latency advantages over PCIe 4.0 SSDs, but do these gains translate at the application level during the various ML training processes? How do these SSDs fare when they must service multiple accelerators simultaneously?
To answer these questions, we compared our PCIe 5.0 SSD, the KIOXIA CM7 Series, with a PCIe 4.0 SSD from a competitor. We used the DLIO benchmarking tool2 that emulates I/O access patterns when deep learning workloads are run against the storage system. We also used the unet3d dataset (a convolutional neural network developed for medical imaging) to see if there were any advantages delivered by the increased performance.
Of interest about this workload was the burst-like nature of the throughput when starting a given epoch. These workloads will try to access all of the data it needs at the start of the current epoch it is on and fit it into accelerator memory. We measured the maximum throughput achieved from the underlying SSD while the workload was running. As the test results show below, when 16, 17 or 18 accelerators were accessing data from the KIOXIA CM7 SSD, it was able to deliver 91% more I/O throughput compared with the competitor’s PCIe 4.0 SSD!
We also monitored the average SSD read latency while the workload was running. Across all numbers of accelerators, the KIOXIA CM7 SSD delivered lower latency than the PCIe 4.0 SSD delivering the given samples to the accelerators faster, to start training on them. The difference truly starts when more accelerators hits the drive at the same time. The KIOXIA CM7 SSD delivered 57% lower read latency (or 57% faster response time) compared with the PCIe 4.0 SSD as shown below:
We were happy to see reduced training times between the PCIe generations when the number of concurrent accelerators increased. KIOXIA CM7 SSD performance started to matter at higher accelerator counts (16, 17 and 18), decreasing training times at higher concurrent accesses. When extrapolating the time taken over a year using these higher accelerator counts, businesses could potentially save up to 32 to 44 days of ML model training in a year!
From our testing, we concluded that the ability for the storage system to supply higher I/O throughput, alongside low latency, increased sample processing rates and effectively decreased overall ML training model times. Upgrading storage solutions that run older PCIe generations to KIOXIA CM7 Series PCIe 5.0 SSDs can certainly improve your businesses’ ML/AI pipeline. See the additional results compiled from our lab testing in the full performance brief available here.
NOTES:
1 Definition of capacity: KIOXIA Corporation defines a megabyte (MB) as 1,000,000 bytes, a gigabyte (GB) as 1,000,000,000 bytes, a terabyte (TB) as 1,000,000,000,000 bytes and a petabyte (PB) as 1,000,000,000,000,000 bytes. A computer operating system, however, reports storage capacity using powers of 2 for the definition of 1Gbit = 230 bits = 1,073,741,824 bits, 1GB = 230 bytes = 1,073,741,824 bytes, 1TB = 240 bytes = 1,099,511,627,776 bytes and 1PB = 240 bytes = 1,125,899,906,842,624 bytes and therefore shows less storage capacity. Available storage capacity (including examples of various media files) will vary based on file size, formatting, settings, software and operating system, and/or pre-installed software applications, or media content. Actual formatted capacity may vary.
2 DLIO is a data centric benchmark for scientific deep learning applications as published by H. Devarajan, H. Zheng, A. Kougkas, X. -H. Sun and V. Vishwanath, "DLIO: A Data-Centric Benchmark for Scientific Deep Learning Applications," 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid), Melbourne, Australia, 2021, pp. 81-91, doi: 10.1109/CCGrid51090.2021.00018, on May 10-13, 2021.
TRADEMARKS:
NVMe is a registered or unregistered trademark of NVM Express, Inc. in the United States and other countries. PCIe is a registered trademark of PCI-SIG. All other company names, product names and service names may be trademarks or registered trademarks of third-party companies.
DISCLAIMERS:
KIOXIA America, Inc. may make changes to specifications and product descriptions at any time. The information presented in this blog is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Any performance tests and ratings are measured using systems that reflect the approximate performance of KIOXIA America, Inc. products as measured by those tests. In no event will KIOXIA America, Inc. be liable to any person for any direct, indirect, special or other consequential damages arising from the use of any information contained herein, even if KIOXIA America, Inc. are advised of the possibility of such damages.
All other company names, product names and service names may be trademarks of their respective companies.