Tech Blog | insightify.io

May 9, 2024

Data Platforms comparison Dremio and Snowflake

Data Platforms vs Large Datasets— Dremio & Snowflake match

In our piece of writing Not your typical Data Warehouse — a tech dive into modern data platforms, where we discussed a high level overview of data platforms that are in constant use in our projects, we have not emphasized advantages of any specific providers. The art below will do some shuffling though, as just recently we have moved toys from one shelf to another, or, in other words — we’ve helped in migrating from Dremio to Snowflake. Here’s the story thus far.

Performance is all

As much as we like Dremio & the open source community around it, we need to point out that the company advertises its platform as capable of performing well without the need for data movement — a great feature if you have data available in different formats and locations, right?

Source: Dremio

The thing is that, based on our experience, it’s only valid for relatively small datasets. To work at scale, a bit more bumpy road needs to be taken where one needs to build pipelines to convert data to Apache Iceberg. By no means a huge hindrance, but a slight hiccup, perchance.

In tech, once you’ve entered a bumpy road, it’s a bit difficult to return — and for us the file conversion was just a threshold of more serious cases that started popping up.

We need to stress here that Dremio’s Team has always been open and very transparent in addressing all the encountered issues. Still, after a couple of back-and-forths with management, a migration decision has been made.

What happened underneath that led to it?

Dremio vs large datasets

Dremio in its cloud version is a platform that uses engines from cloud providers on which you’ve deployed your resources. Thus far Dremio can be deployed on Azure and AWS clouds.

To query data, your engine obviously must be up & running. In our case these engines were backed by AWS EC2 instances.

Despite the fact that AWS offers different types of EC2 instances, as of beginning of 2024 Dremio still allows you to run only two families of instances, and when you define your engines in the platform, you essentially are left only with a choice of assigning particular number of nodes within those two families of instances. No possibility of nuancing it.

To put it simply — the 2XS engine offered in the platform runs on m5d.4xlarge instance, whereas all the bigger ones use m5d.8xlarge EC2 instance.

This setup can unfortunately lead to issues when Dremio is used in a region with a limited number of nodes, or when other customers are extensively using those specific EC2 instances.

For example, in AWS’s North Virginia & London regions, where the demand on these two families is exceptionally high, situations in which you are not able to query your data might occur.

Moreover, we’ve seen situations where the metadata refresh process was taking several hours to complete without the option to cancel it.

If you have a heavy load of files on your object storage — say, millions per partition — and you use commands like insert into, Dremio can trigger metadata refresh function that will roll like Leonard Cohen’s “Avalanche”. And good luck with stopping the process.

One of the general features in Dremio that teams use often is the option to export particular data chunks to CSV for further analysis. Alas, this one can sometimes underperform when working on large datasets; the queries tend to time out, preventing users from downloading the data.

Our guess is that the cookies may expire (or some sort of timeout) so that the end user is left in a limbo with logged out session. Another case one could end up downloading file which at the end was HTML document with error instead of desired CSV file. This can really be frustrating from the analytical standpoint.

Couple of meetings between stakeholders & a few consultations later, we figured that Dremio might not be enough to handle the customer’s huge datasets without losing the edge of ‘no data movement’ policy.

And, since there was no possibility to go on prem, which could have potentially solved all the issues, we had to move on.

A ticket to Snowflake

Our consultative captain obvious had a really ̶t̶o̶u̶g̶h̶ ̶n̶u̶t̶ ̶t̶o̶ ̶c̶r̶a̶c̶k̶ easy choice when deciding where to go next, especially when our customer’s infrastructure was sitting on AWS. Thus, the omnipresent Snowflake entered the game…& pretty much nailed it.

Why?

Snowflake’s uniqueness among data platforms for its full elasticity, and when we write full, we mean both vertical and horizontal scalability with on-demand compute resources.

Horizontal Scaling

From the perspective of dashboards, reports, and data applications this automatic scaling feature provides seamless adjustments based on demand. It ensures high performance when facing rapid increases in user demand. The compute resources are adjusted on the go, which allows high flexibility for your daily data analysis tasks that like to skyrocket from time to time.

Vertical Scaling

For scenarios requiring more significant throughput, Snowflake allows you to scale vertically without downtime This feature comes in handy when you need literally zero downtime allowing data scientists, analysts, and complex workloads to be complete in minimal time.

What’s worth mentioning is that even though vertical scaling lets you add more compute resources, ensuring faster results for complex tasks, it’s designed to maintain a linear compute cost. Common use case would be your Data Scientists team getting answers more quickly without a significant increase in costs.

Workload Isolation

To boost performance even more, Snowflake allows you to allocate specific compute resources to different departments, workloads, or applications, avoiding resource contention and reducing the need for separate platforms.

The three above is enough to rest assured that data, no matter the growth of the data size, will be processed and analyzed without hiccups & changes from 1GB => 1PB requires no resizing or additional maintenance.

Hard not to dig Snowflake in the context of performance.

On top of that, we have…

…Compression

When data is brought into Snowflake, the platform automatically compresses and encrypts it. Snowflake can handle any type of data and transforms it into its unique format, called (FDN). On average, it achieves a compression rate of about 3–10x, balancing storage efficiency and performance. The conservative estimate for compression is 3–5x, while higher compression rates can reach 8–10 times.

As mentioned, all data is stored utilizing Snowflake’s proprietary (FDN) format ensuring consistent performance no matter the data type being used. In most of the platforms we know, compression does differ from format to format and you have to manage it case by case on your own.

The general reduction of data storage cost is a sweet bonus on top of the stable performance.

All in all, Snowflake delivers as promised, i.e. allows you to grow and adapt without costly infrastructure changes or downtime.

For us, performance was all & Snowflake really has that part covered. But, before we jump to conclusions that already loom on the horizon, let’s take a swift look at some other nuggets that Snowflake has to offer.

Governance & Security

Snowflake offers a range of governance features designed to underpin data strategies in organizations. These include Secure Views, Dynamic Data Masking, Row-level Security, Anonymized Views, Object Tagging, Automatic Detection, and detailed Access History to track platform activity. Actually Snowflake is the only platform that enables admins to get detailed information about what specific columns and tables are accessed by users. This allows easy identification of who queries for specific PII data.

Will you need all the features listed? Most probably not, but think about highly regulated industries where Soc2 Type II is a child’s play — & then it starts making sense, right?

Snowflake is designed with enterprise-level security embedded into its platform, ensuring robust protection without requiring additional adjustments. All data within Snowflake is encrypted, both when it’s stored and when it’s being transmitted, and there’s no option to disable this encryption.

Additionally, Snowflake has a built-in mechanism for automatic key rotation, which keeps the data secure at all times. This process is fully managed by the Snowflake system, providing continuous security without user intervention.

Since we’ve already mentioned Soc2 Type 2 — that’s your auditor satisfied.

The ease of it

What we particularly like about Snowflake is how manageable your stuff is under its hood.

The automatic partitioning of your data upon ingestion, optimizing it for efficient query performance without requiring manual management of file size, folder structure, or format — it just works & you don’t have to worry about data organization, as Snowflake handles it seamlessly behind the scenes. Dremio on the other hand has a more ‘DIY folks’ approach to it, which is fair enough, but when the partitioning structure doesn’t align with Dremio’s recommended approach, it could lead to weak query performance. This can result in increased resource consumption, leading to additional costs for every query due to the inefficient use of compute resources. Proper data partitioning is key to maintaining efficiency and minimizing overhead when using Dremio.

What about Dremio?

We’ve written it once & will repeat it again — Dremio is a great data platform & had been meeting our consultants’ expectations for long before the decision to move on was made. There are projects in which we still use it and do not plan any changes. Just within certain contexts migration to Snowflake is a natural cause of business expansion. Bigger needs require platforms with slightly broader shoulders, and we are here to help with the wheelbarrows.

Below is a high-level overview on how we’ve prepared for a seamless transition between this art’s protagonists in a form of clean checks.

Based on the migration process and further usage, we are going to come up with a more specific description on how Snowflake works in action on the same environment & business conditions presented above in the context of Dremio.

Migration from Dremio to Snowflake

Assessment and Inventory

Data Inventory: List all datasets in Dremio, including sources, structures, and partitioning schemes.
Dependencies: Identify business processes and applications that rely on Dremio data.
Compute Resources: Document Dremio’s current compute usage (EC2 and DCUs).

Data Preparation

Data Cleansing: Ensure that the data in Dremio is clean, consistent, and well-organized.
Partitioning and Organization: Review Dremio’s partitioning structure and optimize it for Snowflake’s automatic partitioning.
Backup and Redundancy: Create a comprehensive backup of all Dremio data to ensure safety during migration.

Snowflake Environment Setup

Snowflake Instance: Create a new Snowflake account and set up the necessary compute resources.
Security and Compliance: Implement Snowflake’s security features, ensuring encryption, key rotation, and access control.
Schema Design: Define Snowflake schemas that match or improve upon Dremio’s structure.
Integrations: Connect Snowflake to existing tools and applications for seamless data flow.

Data Migration

Incremental Data Migration: Migrate data incrementally from Dremio to Snowflake to avoid downtime.
Data Validation: Validate data integrity and consistency at each step of the migration process.
Performance Testing: Test Snowflake’s performance to ensure it meets or exceeds Dremio’s performance.

Transition and Cutover

Dual Running: Maintain both Dremio and Snowflake running concurrently to ensure business continuity.
Routing Changes: Gradually shift data requests and queries from Dremio to Snowflake.
Monitoring and Alerting: Implement monitoring tools to detect any issues during transition.

Post-Migration Activities

Verification and Audit: Conduct a thorough audit of Snowflake to ensure all data has been successfully migrated.
Optimization and Tuning: Optimize Snowflake for the best performance, adjusting partitioning and query structure as needed.
Decommissioning Dremio: Once the transition is successful, plan for Dremio’s decommissioning.

Documentation and Knowledge Transfer

Documentation: Update all documentation to reflect the new Snowflake environment.
Training and Knowledge Transfer: Train relevant stakeholders and teams on Snowflake’s features and best practices.

Need to walk similar path?

Drop us a line at contact@insightify.io & we’ll be back right after.

At insightify.io we help organisations migrate, if you consider leaving on premises from cloud solutions, or think about a shift between data platforms, we are here to back you up.