465 We’d like to share some early and promising performance results related to our ARTESCA object storage software. The numbers below are the output from testing of the ARTESCA storage services layer (the storage I/O engine) that was done at an Intel lab. Performance has always been a design goal in our products, and this has now become more important since our customers and application partners are embracing S3 object storage in new and varied ways that require an evolution in performance. The ARTESCA product is our commitment to providing solutions that are responsive to the rapidly changing needs of this ecosystem. Many of our readers are already knowledgeable about the characteristics of traditional object storage products. The general understanding is that object storage can perform very well for larger files, to deliver streaming throughput for images, video and large backup payloads. These are the types of workloads where scalable throughput is the key metric, and some object storage systems can happily deliver dozens of GB/sec of throughput and beyond. In contrast, the perception for smaller file data is that object storage is relatively poor at delivering strong transactional throughput (operations per second) and very low response-time latencies. These performance metrics are now becoming much more relevant as the workloads and data sets applicable for object storage are increasingly varied and demanding: AI and machine learning, IoT, data lakes/analytics and an unpredictable avalanche of applications on the edge. New apps will need to create and consume files that start in the range of 10s to 100s of bytes (think IoT event streams) and handle multi-gigabyte media files. Note that our RING product has been able to perform very well under demanding workloads such as web-scale email, where payloads are often as small as 100s of bytes. So, the new era will have workloads with wide-ranging performance demands. On top of that, new applications will require: Modern Amazon S3-compatible object storage: Comprehensive support of the S3 API, including advanced capabilities High performance across the board: Ultra low-latency access to small files, high (and scalable) ops/sec, while retaining high-throughput capabilities for large files Super-durable storage you can depend on and trust: The solution must be engineered down to the low-level storage services layer to protect data and keep it available in the face of a variety of threats to integrity and loss. The last bullet item here is key. It is much, much easier to create a fast object store if you can relax any one of these constraints. Any time a storage system has to do less work, it can be demonstrated to show better benchmark numbers. This is especially true for some object storage systems that focus mainly on being a lightweight S3 front-end — but do so at the expense of disk management and data protection capabilities. Now enters Scality ARTESCA into this market, designed not only to provide lightweight object storage, but to do so without tradeoffs on those last two bullets. Testing latency and bandwidth of ARTESCA storage services Before we dive into some of the underlying tech, let’s look at the promised results of our storage engine testing. Our goal was to stress the storage engine under demanding small file workloads and see how it performs on NVMe flash drives and SSD used as a write cache. All tests were run on three hosts (storage servers), connected via the 25GbE network interfaces on a switch. We measured read, write and delete operation latencies for small file sizes (4Kb, 32Kb and 64Kb). Important to understanding the results is that we measured the time of each full operation (time to the last byte of the file). Note: More details about the server configurations are described later in this post. Now, let’s first cut to the chase with the performance numbers. In short, the storage engine achieves small (4K file) operation latencies under 1 millisecond on servers with SSD as a write cache, and NVMe flash for data storage (again, timed for the full operation to the last byte of the file, respectively): 0.28 millisecond (ms) read latency with 4k files 0.61 millisecond (ms) write latency with 4k files Sub 1 ms latencies for read/write operations are sustained with 32k files (0.34 & 0.67 ms respectively), and 64k files (0.36 & 0.74 ms respectively) These numbers are achieved with full data protection mechanisms enabled (dual-level erasure coding, as described in the next section) As stated earlier, these are quite promising results, as it shows the heavy lifting required by the storage engine to ensure data is protected can coexist with sub-millisecond-level latencies on these modern flash servers. This is part of the overall system design to accelerate object storage, and in a subsequent post we’ll look at large file and full-stack performance. In the next section, we’ll describe what mechanisms are in place to provide this durable data protection. ARTESCA data durability In designing the storage engine, we had 3 key objectives: Unparalleled data durability and accelerated rebuilds Leverage high-density flash (NVMe, QLC and future media) and maintain reliability of media (also as densities increase in the future) Software-defined architecture to maintain customer’s choice of flexible platform support ARTESCA storage services implement the following mechanisms to ensure data is stored efficiently and protected on high-density storage servers: Dual-level erasure coding (EC): A combination of network (distributed) erasure codes and local parity codes within a server. This provides protection against failure and accelerated local (non-network) repair times for local disk failures, which produces a boost in durability, especially on larger drives. Schemas are configurable for flexibility and also adaptive in the event of drive failures. Data replication: Multiple copies of objects stored across multiple different disk drives and machines. This is optimal in terms of storage efficiency and performance for smaller data objects. Data integrity assurance: Checksums are stored at extent and object level, with continual background scanning (disk scrubbing) and validation of checksums upon read. This focuses on eliminating disk error (“bit rot”)-induced data corruption issues. Failure detection and automated self-healing (repair/rebuild): In the event of disk drive failures, the storage engine can relocate affected (or missing) blocks to other available unallocated storage, and can do so to multiple disks in parallel. In a multi-server ARTESCA cluster, this dual-level EC scheme provides super high levels of disk failure tolerance and durability. To provide a specific example: Assume we have six modern storage servers, 24 disk drives per server. For data durability, the system was configured using dual-level erasure codes as follows: 16+4 (local EC data+parity) and 4+2 (network EC data+parity). In normal terms, each server protects data with local parity, but the system also spreads data across the cluster with distributed data and parity stripes. Note that these durability policies are flexible and can be configured as needed for more durability or better space efficiency. Here are the failures tolerated by this system without loss of data or service availability: Local EC on each server: 4 drives per server can fail PLUS with network EC: 2 full servers can fail simultaneously (48 drives) So, the calculation is: 2 servers can entirely fail (48 drives), plus the remaining 4 servers can tolerate 4 drives each (16) drives for a total of 64 drives lost. In this configuration, the system maintains availability and data protection even with 44% of the drives failing. The result is massive levels of durability protection, as it provides 10X more failure protection than with classical distributed erasure-coding while still maintaining data protection. The tech: An inside look at ARTESCA storage services Let’s take a look under the covers at the storage engine to see how ARTESCA achieves both durability and high performance. ARTESCA uses fixed-size extents to store object data payloads, that alone provides multiple benefits: It avoids over-allocation of file system inodes, improves scalability and also helps avoid fragmentation. Small objects fit into available space in the open extent (larger files can be split and stored across extents). As objects are stored, they are allocated space in the current open extent and written sequentially to available space. Once closed, the engine computes parity on the extents (local parity codes, as described earlier) on each server and stores them with an extent-level checksum. As extents are filled, they are closed — and new extents are allocated. This also helps optimize high-density flash, since the fixed-extent storage scheme fits naturally with flash media IO patterns. It safely stores extents initially on fast flash media (SSD was used as a write buffer in our test above), and once filled, the system can stage these locked (filled and erasure coded) extents down to permanent long-term media (NVMe or QLC flash). A further effect is minimized IO on the long-term media and preserving program/erase (P/E) cycles on high-density flash devices. The test configuration Some more information on the setup we used to conduct the lab test described earlier. We used three modern Intel-based 24-drive storage servers, each configured as follows: Intel Xeon-Gold 6226R (2.9 GHz/16-core/150 W) processor 128GB (1 x 128GB) quad rank x4 DDR4-2933 1 x HPE Write Intensive – SSD – 750 GB – PCIe x4 (NVMe) 1 NVME, 750GB (OS/system disk) 6 NVME, 3.84TB (NVMe Gen4 mainstream performance SFF SSD) Ethernet 10/25Gb 2-port adapter Note that the results above were achieved on servers that were partially populated with disk drives, and it is certainly the case that more disks would further improve performance for IO-intensive workloads.