In our piece of writing Not your typical Data Warehouse — a tech dive into modern data platforms, where we discussed a high level overview of data platforms that are in constant use in our projects, we have not emphasized advantages of any specific providers. The art below will do some shuffling though, as just recently we have moved toys from one shelf to another, or, in other words — we’ve helped in migrating from Dremio to Snowflake. Here’s the story thus far.
As much as we like Dremio & the open source community around it, we need to point out that the company advertises its platform as capable of performing well without the need for data movement — a great feature if you have data available in different formats and locations, right?
The thing is that, based on our experience, it’s only valid for relatively small datasets. To work at scale, a bit more bumpy road needs to be taken where one needs to build pipelines to convert data to Apache Iceberg. By no means a huge hindrance, but a slight hiccup, perchance.
In tech, once you’ve entered a bumpy road, it’s a bit difficult to return — and for us the file conversion was just a threshold of more serious cases that started popping up.
We need to stress here that Dremio’s Team has always been open and very transparent in addressing all the encountered issues. Still, after a couple of back-and-forths with management, a migration decision has been made.
What happened underneath that led to it?
Dremio in its cloud version is a platform that uses engines from cloud providers on which you’ve deployed your resources. Thus far Dremio can be deployed on Azure and AWS clouds.
To query data, your engine obviously must be up & running. In our case these engines were backed by AWS EC2 instances.
Despite the fact that AWS offers different types of EC2 instances, as of beginning of 2024 Dremio still allows you to run only two families of instances, and when you define your engines in the platform, you essentially are left only with a choice of assigning particular number of nodes within those two families of instances. No possibility of nuancing it.
To put it simply — the 2XS engine offered in the platform runs on m5d.4xlarge instance, whereas all the bigger ones use m5d.8xlarge EC2 instance.
This setup can unfortunately lead to issues when Dremio is used in a region with a limited number of nodes, or when other customers are extensively using those specific EC2 instances.
For example, in AWS’s North Virginia & London regions, where the demand on these two families is exceptionally high, situations in which you are not able to query your data might occur.
Moreover, we’ve seen situations where the metadata refresh process was taking several hours to complete without the option to cancel it.
If you have a heavy load of files on your object storage — say, millions per partition — and you use commands like insert into, Dremio can trigger metadata refresh function that will roll like Leonard Cohen’s “Avalanche”. And good luck with stopping the process.
One of the general features in Dremio that teams use often is the option to export particular data chunks to CSV for further analysis. Alas, this one can sometimes underperform when working on large datasets; the queries tend to time out, preventing users from downloading the data.
Our guess is that the cookies may expire (or some sort of timeout) so that the end user is left in a limbo with logged out session. Another case one could end up downloading file which at the end was HTML document with error instead of desired CSV file. This can really be frustrating from the analytical standpoint.
Couple of meetings between stakeholders & a few consultations later, we figured that Dremio might not be enough to handle the customer’s huge datasets without losing the edge of ‘no data movement’ policy.
And, since there was no possibility to go on prem, which could have potentially solved all the issues, we had to move on.
Our consultative captain obvious had a really ̶t̶o̶u̶g̶h̶ ̶n̶u̶t̶ ̶t̶o̶ ̶c̶r̶a̶c̶k̶ easy choice when deciding where to go next, especially when our customer’s infrastructure was sitting on AWS. Thus, the omnipresent Snowflake entered the game…& pretty much nailed it.
Why?
Snowflake’s uniqueness among data platforms for its full elasticity, and when we write full, we mean both vertical and horizontal scalability with on-demand compute resources.
From the perspective of dashboards, reports, and data applications this automatic scaling feature provides seamless adjustments based on demand. It ensures high performance when facing rapid increases in user demand. The compute resources are adjusted on the go, which allows high flexibility for your daily data analysis tasks that like to skyrocket from time to time.
For scenarios requiring more significant throughput, Snowflake allows you to scale vertically without downtime This feature comes in handy when you need literally zero downtime allowing data scientists, analysts, and complex workloads to be complete in minimal time.
What’s worth mentioning is that even though vertical scaling lets you add more compute resources, ensuring faster results for complex tasks, it’s designed to maintain a linear compute cost. Common use case would be your Data Scientists team getting answers more quickly without a significant increase in costs.
To boost performance even more, Snowflake allows you to allocate specific compute resources to different departments, workloads, or applications, avoiding resource contention and reducing the need for separate platforms.
The three above is enough to rest assured that data, no matter the growth of the data size, will be processed and analyzed without hiccups & changes from 1GB => 1PB requires no resizing or additional maintenance.
Hard not to dig Snowflake in the context of performance.
On top of that, we have…
When data is brought into Snowflake, the platform automatically compresses and encrypts it. Snowflake can handle any type of data and transforms it into its unique format, called (FDN). On average, it achieves a compression rate of about 3–10x, balancing storage efficiency and performance. The conservative estimate for compression is 3–5x, while higher compression rates can reach 8–10 times.
As mentioned, all data is stored utilizing Snowflake’s proprietary (FDN) format ensuring consistent performance no matter the data type being used. In most of the platforms we know, compression does differ from format to format and you have to manage it case by case on your own.
The general reduction of data storage cost is a sweet bonus on top of the stable performance.
All in all, Snowflake delivers as promised, i.e. allows you to grow and adapt without costly infrastructure changes or downtime.
For us, performance was all & Snowflake really has that part covered. But, before we jump to conclusions that already loom on the horizon, let’s take a swift look at some other nuggets that Snowflake has to offer.
Snowflake offers a range of governance features designed to underpin data strategies in organizations. These include Secure Views, Dynamic Data Masking, Row-level Security, Anonymized Views, Object Tagging, Automatic Detection, and detailed Access History to track platform activity. Actually Snowflake is the only platform that enables admins to get detailed information about what specific columns and tables are accessed by users. This allows easy identification of who queries for specific PII data.
Will you need all the features listed? Most probably not, but think about highly regulated industries where Soc2 Type II is a child’s play — & then it starts making sense, right?
Snowflake is designed with enterprise-level security embedded into its platform, ensuring robust protection without requiring additional adjustments. All data within Snowflake is encrypted, both when it’s stored and when it’s being transmitted, and there’s no option to disable this encryption.
Additionally, Snowflake has a built-in mechanism for automatic key rotation, which keeps the data secure at all times. This process is fully managed by the Snowflake system, providing continuous security without user intervention.
Since we’ve already mentioned Soc2 Type 2 — that’s your auditor satisfied.
What we particularly like about Snowflake is how manageable your stuff is under its hood.
The automatic partitioning of your data upon ingestion, optimizing it for efficient query performance without requiring manual management of file size, folder structure, or format — it just works & you don’t have to worry about data organization, as Snowflake handles it seamlessly behind the scenes. Dremio on the other hand has a more ‘DIY folks’ approach to it, which is fair enough, but when the partitioning structure doesn’t align with Dremio’s recommended approach, it could lead to weak query performance. This can result in increased resource consumption, leading to additional costs for every query due to the inefficient use of compute resources. Proper data partitioning is key to maintaining efficiency and minimizing overhead when using Dremio.
We’ve written it once & will repeat it again — Dremio is a great data platform & had been meeting our consultants’ expectations for long before the decision to move on was made. There are projects in which we still use it and do not plan any changes. Just within certain contexts migration to Snowflake is a natural cause of business expansion. Bigger needs require platforms with slightly broader shoulders, and we are here to help with the wheelbarrows.
Below is a high-level overview on how we’ve prepared for a seamless transition between this art’s protagonists in a form of clean checks.
Based on the migration process and further usage, we are going to come up with a more specific description on how Snowflake works in action on the same environment & business conditions presented above in the context of Dremio.
Drop us a line at contact@insightify.io & we’ll be back right after.
At insightify.io we help organisations migrate, if you consider leaving on premises from cloud solutions, or think about a shift between data platforms, we are here to back you up.