By Paul Speciale, chief product officer, Scality
The term “data lake” — a centralized repository that holds a vast amount of raw data in its native format — has only been around for about a decade. Despite the relatively new term, data lakes are expected to reach an annual market volume of $20.1 billion by 2025, according to Research and Markets.
Usually, a data lake houses data from many sources in multiple formats — all of which requires analysis in order to yield business insights. Increasingly, we hear “data lake” and “big data” mentioned in the same breath. And that makes sense, because big data analytics requires a massive trove of data to derive insights from.
The need for flexible, scalable management of all data formats
Because data lakes aggregate data from various sources, they can quickly reach petabyte scale and beyond. This data volume exceeds the capacity of traditional database technologies, such as relational database management systems (RDBMS), which were primarily designed to handle structured data.
Not only is there a potential capacity issue, but data lakes amass structured, semi-structured, and unstructured data. To flexibly and scalably manage these different data types, new storage systems like the Hadoop distributed file system (HDFS) have been used as a data lake storage solution. But, like any technology, HDFS has its limitations.
A major downside to HDFS is that its compute and storage resources are tightly coupled as it scales (because the file system is hosted on the same machines as the application). Computing capacity and memory grow together, which can end up being quite expensive.
Modern object storage offers fundamental advantages for data lakes
To fully reap the business insights that lie in these massive data lakes, organizations depend on both analytics tools and the storage repository where the data is stored — the latter is arguably most important.
Why? Because the repository must process data from various sources with just the right performance, plus it must be able to grow in both performance and capacity so that data is broadly available to applications, tools and users.
In the search for greater scalability, flexibility and lower cost, object storage is quickly emerging as the storage standard for data lakes.
With object storage, there’s no limit on the volume of data. Another key benefit is that it accommodates all types of data without the need for predefined “schemas” (as is the case with RDBMS where the structure and relationships between tables for complex queries must be predefined) — this capability increases flexibility.
In addition, modern object storage systems like Scality support independent scale-out of capacity and performance — a major bonus for large analytics projects. Being able to independently scale offers the right compute performance for data analysis — on demand — and substantially decreases the total cost of a data lake solution.
Object storage has also been embraced by application vendors in their quest to solve the challenges of increasing data capacities for customers. Solutions such as Splunk now support object storage via the SmartStore interface (which leverages the Amazon S3 API), and Microfocus Vertica provides EON mode (which also leverages S3).
These solutions decouple the compute (search) tier from the persistent capacity tier, giving users more flexibility and cost efficiency while at the same time enabling much higher data volumes to make analytics more effective. Furthermore, the Apache Spark tool ecosystem which traditionally used HDFS for storage, is also compatible with S3 object storage over the S3A Hadoop-compatible file system interface, which leverages the S3 API.
Want to know more about the advantages of object storage for data lakes? Read my recent article for Data Center Dynamics.