By Pierre Gueant
Data warehousing is the process of collecting, storing, and managing large amounts of data from various sources in a centralized repository. A data warehouse is designed to support business intelligence (BI) activities, such as reporting, data analysis, and data mining, by providing a unified view of an organization’s data.
- The first step in data warehousing involves extracting data from multiple sources: relational databases, spreadsheets, CRM applications, machines-generated data, and/or social media and web data.
- Data is then transformed into a standardized format and loaded into the data warehouse. Popular formats include CSV, JSON, Parquet, ORC, and Avro.
- Once the data is in the warehouse, it can be used for various BI activities to gain insights into an organization’s operations, customers, performance, and other key business metrics.
Data warehousing is commonly used by large enterprises and provides a structured and efficient way to manage and process large volumes of data enabling organizations to make informed business decisions based on data-driven insights.
What Snowflake brings to the table
Snowflake is a cloud-based data warehousing company that provides a platform for storing, managing, and analyzing large amounts of structured and semi-structured data.
Snowflake’s key product characteristics include its scalability, security, and ease of use. The platform is designed to handle massive amounts of data, with the ability to scale up or down as needed, making it an ideal solution for organizations with fluctuating data needs.
Additionally, Snowflake places a strong emphasis on security, with features such as encryption and multi-factor authentication to ensure data is protected. Finally, the platform is designed to be easy to use, with a user-friendly interface and the ability to integrate with a variety of popular business intelligence and analytics tools.
Snowflake’s storage capacity can range from a few terabytes to multiple petabytes, depending on the user’s requirements and usage patterns.
Coupling Snowflake with on-premises object storage
While the public cloud is an option for many enterprises, more and more organizations are looking to repatriate or keep data and workloads on-premises for cost, security, data sovereignty, and other reasons. By integrating Snowflake’s powerful data analytics solution with on-premises object storage, organizations get the benefit of real-time business insights while maintaining complete control over where their data is stored.
Using Snowflake with on-premises Amazon S3-compatible object storage
The Amazon S3 API has become an industry standard for object storage. Large enterprises with specific security, privacy and regulatory requirements may choose to store all or part of their analytics data on-premises. On-premises S3 compatible object storage, such as Scality RING and Scality ARTESCA, is a natural choice because these solutions bring high data resiliency and high performance access to the data, at an affordable cost per terabyte.
Snowflake runs on public cloud infrastructure, such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). While the Snowflake architecture is not designed to run on-premises or in a private cloud environment, Snowflake supports the S3 API. As a result, Snowflake can integrate with Scality’s on-premises S3-compatible storage.
It is possible to create external stages (locations to store data) and external tables against RING and ARTESCA. This capability extends features and functionalities that work with external stages and tables to data outside of the public cloud, either on-premises or at the edge.
Organizations can establish a secure connection between their on-premises Scality storage and their Snowflake account, allowing them to load data into Snowflake using their existing data pipelines.
This integration enables customers to take advantage of Snowflake’s cloud data warehousing capabilities, while still keeping their data within their own environment, providing them with greater control and flexibility over their data management.
How to connect Snowflake to your on-premises Scality RING or Scality ARTESCA
To connect Snowflake with Scality RING or ARTESCA, the endpoint must be:
- Accessible from Snowflake compute in the public cloud where you run Snowflake.
- Set to use direct credentials.
- Set to use HTTPS communication with a valid SSL certificate
Setting up S3-compatible storage
To use S3-compatible storage with Snowflake, you must first create an external stage that points to the storage device.
create or replace stage my_s3compat_stage url='s3compat://my_bucket/my_files/' endpoint='s3.storage.com' credentials=(aws_key_id='1a2b3c' aws_secret_key='4x5y6z')
S3-compatible storage introduces two new additions to the create external stage syntax:
- URL Prefix s3compat – This signals to Snowflake that it’s connecting to a device with an S3-compliant API.
- The endpoint parameter – This is the fully-qualified domain that points to the S3 API endpoint.
Note that only direct credentials are supported with S3-compatible storage using external stages.
Using S3-compatible storage
S3-compatible stages offer the same functionality as external stages. This means you can copy data in and out of the stage:
copy into my_table from @my_s3compat_stage; copy into @my_s3compat_stage/path from my_table;
You can also create external tables on S3-compatible storage for performing analytical queries on the data. External table features, such as data sharing, work with external tables created on S3-compatible stages.
create or replace external table my_ext_table with location = @my_s3compat_stage/files/ auto_refresh = false refresh_on_create = true file_format = (type = parquet) pattern='.*sales.*[.]parquet';
External tables for S3-compatible storage have the following limitations:
- Auto-refresh is not supported
- Query performance will vary depending on networking and device performance
Improved data warehousing with increased security, decreased costs
Combining on-premises Scality object storage with Snowflake provides the best of both worlds — modern data warehousing capabilities and robust privacy compliance, security, and cost management. This hybrid approach allows large companies to keep sensitive data on-premises while leveraging Snowflake’s cloud-native platform for data warehousing, delivering a powerful and flexible solution for modern data-intensive workloads.