Why object storage is ideal for data lakes

The term “data lake” — a centralized repository that holds a vast amount of raw data in its native format — has only been around for about a decade. Despite the relatively new term, data lakes are expected to reach an annual market volume of $20.1 billion by 2025, according to Research and Markets.

Usually, a data lake houses data from many sources in multiple formats — all of which requires analysis in order to yield business insights. Increasingly, we hear “data lake” and “big data” mentioned in the same breath. And that makes sense, because big data analytics requires a massive trove of data to derive insights from.

The need for flexible, scalable management of all data formats

Because data lakes aggregate data from various sources, they can quickly reach petabyte scale and beyond. This data volume exceeds the capacity of traditional database technologies, such as relational database management systems (RDBMS), which were primarily designed to handle structured data.

Not only is there a potential capacity issue, but data lakes amass structured, semi-structured, and unstructured data. To flexibly and scalably manage these different data types, new storage systems like the Hadoop distributed file system (HDFS) have been used as a data lake storage solution. But, like any technology, HDFS has its limitations.

A major downside to HDFS is that its compute and storage resources are tightly coupled as it scales (because the file system is hosted on the same machines as the application). Computing capacity and memory grow together, which can end up being quite expensive.

Modern object storage offers fundamental advantages for data lakes

To fully reap the business insights that lie in these massive data lakes, organizations depend on both analytics tools and the storage repository where the data is stored — the latter is arguably most important.

Why? Because the repository must process data from various sources with just the right performance, plus it must be able to grow in both performance and capacity so that data is broadly available to applications, tools and users.

In the search for greater scalability, flexibility and lower cost, object storage is quickly emerging as the storage standard for data lakes.

With object storage, there’s no limit on the volume of data. Another key benefit is that it accommodates all types of data without the need for predefined “schemas” (as is the case with RDBMS where the structure and relationships between tables for complex queries must be predefined) — this capability increases flexibility.

In addition, modern object storage systems like Scality support independent scale-out of capacity and performance — a major bonus for large analytics projects. Being able to independently scale offers the right compute performance for data analysis — on demand — and substantially decreases the total cost of a data lake solution.

Object storage has also been embraced by application vendors in their quest to solve the challenges of increasing data capacities for customers. Solutions such as Splunk now support object storage via the SmartStore interface (which leverages the Amazon S3 API), and Microfocus Vertica provides EON mode (which also leverages S3).

These solutions decouple the compute (search) tier from the persistent capacity tier, giving users more flexibility and cost efficiency while at the same time enabling much higher data volumes to make analytics more effective. Furthermore, the Apache Spark tool ecosystem which traditionally used HDFS for storage, is also compatible with S3 object storage over the S3A Hadoop-compatible file system interface, which leverages the S3 API.

Want to know more about the advantages of object storage for data lakes? Read my recent article for Data Center Dynamics.

Why object storage is ideal for data lakes

The need for flexible, scalable management of all data formats

Modern object storage offers fundamental advantages for data lakes

Paul Speciale

Related Posts

Data sovereignty is king with Scality

Navigating data privacy, data sovereignty and data protection: How Scality helps you...

4 practical measures for cloud sovereignty in the European Union

Data sovereignty? Why an on-prem private cloud is the answer

New research: Data sovereignty concerns highlight shift to hybrid cloud

How to avoid public cloud lock-in and keep data sovereign

About Us

Useful Links

Editors' Picks

COME MEET US

Why object storage is ideal for data lakes

The need for flexible, scalable management of all data formats

Modern object storage offers fundamental advantages for data lakes

Meet Melissa Lyons, our new channel superstar

The world’s fastest — and safest — object storage

Related Posts

About Us

Useful Links

Editors' Picks

COME MEET US