Tech Blog | insightify.io

February 23, 2024

DATA WAREHOUSE DATA PLATFORM BIGQUERY DREMIO SNOWFLAKE

Not your typical Data Warehouse — a tech dive into modern data platforms

Introduction

This is the first one from series of articles from Insightify.io about modern data platforms. We are kicking off with a high-level overview of three tools of the trade we use on daily basis. Treat it as a starter, as we are cooking some deep dive arts into more detailed features & the tech behind those tools. Data warehouses and data lakehouses are crucial components of modern data management, providing a centralised repository for storing and managing large volumes of structured and unstructured data. In this article, let us take a look at BigQuery and Snowflake, which are not your typical cloud-based data warehouses, and Dremio — a platform that facilitates access to data from various sources without the need for data movement, distinguishing it from traditional warehouse solutions.

Why these three?

BigQuery is arguably the most mature and performant serverless data warehouse solution from prominent cloud providers.
Snowflake is cloud-agnostic, scales easily and suits best for companies with a multi-cloud architecture.
Dremio is the only open-source representative here which offers similar cloud-agnostic capabilities, but without a vendor-lock.

Overview

BigQuery, Google Cloud’s fully managed, serverless data warehouse, offers a wide range of capabilities for storing, querying, and analysing large datasets. Some of the key possibilities and features include:

Scalability: BigQuery is designed to handle petabytes of data, making it suitable for organisations of any size. It automatically scales to accommodate growing data volumes without the need for manual provisioning or management.
Serverless: With BigQuery, users can focus on analysing data rather than managing infrastructure. It eliminates the need for managing servers, ensuring high availability and reliability without the overhead of infrastructure maintenance.
Scalability: BigQuery enables real-time analytics with features like streaming inserts and continuous queries. This allows organisations to analyze and derive insights from streaming data as it arrives.
Standard SQL: BigQuery supports standard SQL for querying data, making it familiar and accessible to SQL users. It also offers advanced SQL functionality for complex analytics tasks.
Machine Learning Integration: BigQuery ML allows users to build and deploy machine learning models directly within BigQuery using SQL queries. This simplifies the process of incorporating machine learning into data analysis workflows.
Advanced Analytics: BigQuery offers advanced analytics capabilities, including window functions, arrays, and user-defined functions, enabling complex data manipulation and analysis.

Snowflake is a cloud-based data warehousing platform that offers a range of capabilities for storing, processing, and analyzing data. Here’s a short overview of Snowflake’s possibilities:

Scalability: Snowflake is designed to handle large volumes of data, ranging from gigabytes to petabytes, with built-in scalability that automatically adjusts to workload demands.
Multi-cluster architecture: Snowflake’s multi-cluster architecture enables concurrent workloads without performance degradation, allowing users to run multiple queries and workloads simultaneously.
Separation of storage and compute: Snowflake separates storage and compute resources, allowing users to independently scale each component based on their needs. This architecture provides flexibility and cost efficiency by only paying for the resources consumed.
Support for semi-structured data: Snowflake natively supports semi-structured data formats like JSON, Avro, Parquet, and XML, enabling users to store and analyse diverse data types without requiring preprocessing. However, the way how Snowflake represents semi-structured data is somewhat unique and might be a bit tricky at first.
Data sharing: Snowflake facilitates easy and secure data sharing between organisations, departments, or partners through its built-in data sharing capabilities. Users can share live data without copying or moving it, maintaining data consistency and integrity.

Dremio is a data-as-a-service platform that offers a range of capabilities for data discovery, integration, and analytics. Here’s a short overview of Dremio’s possibilities:

Self-service data access: Dremio enables self-service data access for users across the organisation, allowing them to discover and explore data from various sources without relying on IT or data engineering teams.
Data virtualisation: Dremio provides data virtualisation capabilities, allowing users to query and analyse data from multiple sources in real-time without the need to copy or move data. This simplifies data access and reduces data duplication.
Unified semantic layer: Dremio offers a unified semantic layer that abstracts the complexities of underlying data sources, providing a consistent interface for querying and analysing data regardless of its location or format.
Data acceleration: Dremio accelerates data access and query performance through its distributed query engine, which leverages advanced techniques like query pushdown, columnar caching, and parallel processing to optimise query execution.
Native SQL support: Dremio supports standard SQL for querying and analysing data, making it familiar and accessible to SQL users. It also provides support for advanced SQL features and functions for complex analytics tasks.
Data lineage and cataloging: Dremio offers data lineage and cataloging capabilities, allowing users to track the origin and lineage of data, as well as cataloging metadata for improved data discovery and governance.

Architecture & Initial Setup

BigQuery

Serverless: One of BigQuery’s most significant advantages is its fully serverless offering. This means that users don’t need to worry about provisioning or managing any infrastructure. BigQuery automatically scales computing and storage resources as needed without any manual intervention. Users can start analyzing their data immediately without considering compute instance sizes or resource allocation.
Ease of Setup: Setting up BigQuery involves creating a project in Google Cloud Platform (GCP), enabling the BigQuery API, and starting to load or query data. The process is streamlined and designed to minimise the initial configuration overhead.

Source: Google

Snowflake

Semi-managed: While Snowflake abstracts a lot of the infrastructure management and offers a more straightforward approach to data warehousing than traditional options, it still requires some initial setup regarding compute resources. Users need to choose and configure virtual warehouses, which are essentially compute clusters, to execute their workloads. These warehouses can be scaled up or down based on demand, but this scaling often requires user intervention or auto-scaling policies to be set up in advance.
Resource and Warehouse Configuration: Initial setup in Snowflake includes setting up data storage containers (databases), and compute resources (warehouses), and configuring roles and permissions. This setup process provides flexibility but also requires decision-making from the user about the size and scale of the compute resources.

Source: Snowflake

Dremio

Configurable Environment: Dremio also requires a bit more setup compared to BigQuery. It is designed to run on top of existing data lakes and offers capabilities to optimise and manage data queries directly on those data sources. The initial setup involves deploying Dremio on your chosen environment (cloud or on-premises), configuring the clusters, and connecting to your data lake storage.
Cluster Management: Users need to manage and scale the Dremio clusters according to their workload requirements. This process includes setting up the software, choosing the hardware or virtual machine types, and scaling resources as necessary. Dremio offers flexibility in how data is accessed and queried, but this comes with a need for more hands-on management and configuration.

Source: Dremio

Data Loading & Integration

BigQuery

Methods: BigQuery supports various data loading methods, including batch loading from Google Cloud Storage, streaming inserts for real-time data ingestion, and transfer services from external sources like Google Ads, Google Sheets, and third-party SaaS applications.
Formats: It can ingest a wide range of formats, including CSV, JSON, Avro, Parquet, and ORC.
Serverless Nature: The serverless architecture means there’s no need to manage resources for data loading. However, there are quotas and limits on streaming inserts to consider.
Ease of Use: Loading data is generally straightforward, with support for automatic schema detection and the ability to load data directly from the Google Cloud Console, command-line tool (bq command), or client libraries.

Snowflake

Methods: Snowflake allows data loading through batch loading from cloud storage services (Amazon S3, Google Cloud Storage, Azure Blob Storage), Snowpipe for continuous, near-real-time data ingestion, and third-party data integration tools.
Formats: Supports a variety of file formats, including CSV, JSON, Avro, Parquet, ORC, and XML with a variant data type. This capability is powerful but can initially seem odd compared to strongly typed data structures in BiqQuery. The lack of strong typing enables users to store and query data with schema-on-read approach. Snowflake dynamically interprets the schema, allowing users to access nested elements directly through path expressions or functions. However, this flexibility may require users to have a deeper understanding of their data structures for effective querying and analysis. What’s more, the responsibility for managing data integrity shifts from the database system to the application logic.
Virtual Warehouse: The compute resources (virtual warehouses) must be sized appropriately for the data loading process, as they affect the speed and efficiency of data ingestion.
Stage and Load: The typical process involves staging files in cloud storage, then using SQL commands or Snowflake’s Web UI to load the data. Snowpipe automates this process for continuous ingestion.

Dremio

Methods: Dremio is designed to work directly on top of data lakes, allowing users to query data in place without the need to load it into a separate storage system. However, for optimised performance, Dremio can cache data using reflections.
Direct Querying: It primarily accesses data directly from data lake storage (e.g., Amazon S3, Azure Data Lake Storage, HDFS) without the traditional load process.
Reflections: Dremio’s reflections can be used to accelerate query performance on frequently accessed data, which involves creating optimised representations of the original data. Data is converted to Apache Iceberg and queries are routed to this version instead of reaching raw data.
Flexibility: Provides a flexible approach to handling data, focusing on minimising data movement and duplication. However, users need to manage and configure reflections for optimal performance.

Performance & Optimisation

BigQuery

Performance: BigQuery’s serverless nature allows for high performance without the need for manual tuning. It leverages Google’s private fiber network and Dremel technology to execute queries rapidly across massive datasets.
Optimisation: BigQuery automatically manages and optimises query execution. It also offers features like materialised views and partitioned tables to speed up query performance. BigQuery’s BigQuery ML enables machine learning models to be built and executed directly within the data warehouse, optimizing for predictive analytics tasks.

Snowflake

Performance: Snowflake’s architecture allows for the independent scaling of compute resources, which means performance can be adjusted based on workload demands. Its caching mechanism stores query results, reducing the time for subsequent queries.
Optimisation: Snowflake provides automatic clustering (re-clustering) to optimise the organisation of data in storage, which enhances query performance. Users can also utilise Snowflake’s Resource Monitors to track and limit computing resources, ensuring that performance is aligned with budget constraints.

Dremio

Performance: Dremio excels in querying directly on data lakes, minimising data movement. Its use of Apache Arrow for in-memory data processing significantly speeds up query response times. If you intend to use it on large datasets in not optimal format e.g. JSON/CSV files the performance won’t be optimal and you will end up converting data to Apache Iceberg using either Dremio Reflections or building pipeline yourselves.
Optimisation: Dremio’s reflections (materialised views) allow for automatic optimisation of queries by creating optimised data structures (Apache Iceberg) for frequently accessed data. Dremio also offers a query planner that optimises queries before execution, and its engine can scale resources horizontally based on workload requirements.

Source: Dremio

Performance Tip: Dremio advertises it’s platform as capable of performing well without the need for data movement meaning if you have data available in different formats and locations. Based on our expertise, it’s only valid for relatively small datasets. To work at scale, you will need to build pipelines to convert data to Apache Iceberg. This step diminishes the advertised advantage of no data movement.

Security & Compliance

BigQuery

Security Features: BigQuery integrates seamlessly with Google Cloud’s comprehensive security model, including IAM (Identity and Access Management) for fine-grained access control, encryption at rest and in transit, and private connectivity options through Google’s VPC Service Controls. It also supports customer-managed encryption keys (CMEK) for enhanced data protection.
Compliance: BigQuery is compliant with major global, regional, and industry-specific standards, such as GDPR, HIPAA, ISO/IEC 27001, and SOC 1/2/3, among others. This wide range of compliance certifications makes it a viable option for various industries, including healthcare, finance, and public sectors.

Snowflake

Security Features: Snowflake provides a robust set of security features, including always-on encryption of data at rest and in transit, support for customer-managed keys, and multi-factor authentication. Its role-based access control system enables detailed permission settings for different users and groups. Snowflake also offers features like network policies and private connectivity (via AWS PrivateLink or Azure Private Link) to control access to data.
Compliance: Snowflake maintains compliance with a broad array of standards, including SOC 1/2, PCI DSS, GDPR, HIPAA, and FedRAMP, making it suitable for a wide range of regulatory requirements. Its commitment to security is designed to meet the needs of highly regulated industries.

Dremio

Security Features: Dremio emphasises security through features like encryption at rest and in transit, integration with Kerberos for authentication, and support for LDAP and Active Directory. It also provides fine-grained access control at the dataset level, allowing organisations to manage who can access specific data within their data lakes.
Compliance: While Dremio’s compliance certifications might vary based on the deployment model (cloud or on-premises) and the underlying data storage system, it is designed to support compliance with standards like GDPR and HIPAA through its security features. Organisations are encouraged to ensure compliance based on their specific deployment and data storage choices.

Cost Management

BigQuery

BigQuery offers a pay-as-you-go pricing model, which means you only pay for the amount of data you process. There is also capacity-based pricing, you can reserve defined number of slots (essentially virtual cpus) which are available to you and you are charged for every second of using them.

The former is easy and relatively cheap to start with, but you need to carefully query your data and fully utilise partitioning in order not to bump your costs exponentially. It might be difficult to predict what would be the final cost of using it. On the other hand the latter is predictive in terms of expected bill, but it’s relatively expensive for smaller organisations.

On top of it there is also the cost of storing data in BigQuery is $0.02 per GB per month, and the cost of querying data is $5 per TB. BigQuery also offers a free tier that includes 1 TB of querying per month and 10 GB of storage.

Snowflake

Snowflake offers a pay-as-you-go pricing model, which means you only pay for the amount of data you process. The cost of storing data in Snowflake is $23 per TB per month, and the cost of querying data is $40 per TB. Snowflake also offers a free trial that includes $400 of credit for 30 days.

Dremio

Dremio offers a free community edition and a paid enterprise edition. The community edition is free to use and includes all the features of the enterprise edition, but with some limitations. License is free, but not computing resources utilised to run Dremio e.g. AWS EC2 instances.

Enterprise edition is priced at $0.39/DCU. Dremio Consumption Unit is the unit used to measure consumption on the Dremio platform. The number of DCUs consumed is based on compute engine size and engine running time, measured on a per-second basis.

Scalability

BigQuery and Snowflake both provide automatic scaling capabilities, while Dremio offers semi-automatic scaling. BigQuery and Snowflake can scale up to petabyte-scale data, while Dremio can scale up to terabyte-scale data.

Performance

BigQuery and Snowflake both provide high-performance query processing, while Dremio provides interactive query performance. BigQuery and Snowflake both provide columnar storage, while Dremio provides row-based storage.

Ease of Use

BigQuery, Snowflake and Dremio provide web-based consoles for managing data. Since Dremio uses object storage as a primary storing location, it is more demanding to wrap your head around it’s UI and ways of working with data especially if you are familiar with working with more traditional data warehouses.

Use Cases

BigQuery

BigQuery is ideal for organisations that need to process large volumes of data quickly and they already have the GCP infrastructure. It is well-suited for ad-hoc analysis, machine learning, and real-time analytics. It is easy to use and start with. If your organisation have a lot of data it might be the initial choice.

Snowflake

Snowflake is ideal for organisations that need to process large volumes of data with high concurrency. It is well-suited for data warehousing, data sharing, and data lake integration. Data sharing capabilities excels other too competitors so you envision a lot of it, Snowflake could be the best choice. It requires more knowledge in an organisation to manage workloads and properly size warehouses.

Dremio

Dremio is ideal for smaller organisations which may want to explore data more since you are not limited to building pipelines to load data to data warehouse first before starting exploration. With Dremio you can query data in other databases like MySQL, object storages like S3, Google Cloud Storage, NoSQL databases without a need for building anything upfront. Except setting up proper connections of course.

Conclusion

BigQuery, Snowflake, and Dremio are all powerful data warehousing tools that provide unique features and capabilities on sturdy architectures. At Insightify, we have avid supporters of all three platforms & have seen them in prod action hundreds of times & never falling flat.

Yet, as much as we are aware the proof is in the pudding, and the choice of data warehousing tools is not your typical SaaS subscription 50$/mo purchase, we are already cooking more in-depth arts that will uncover the tech specifics of the whole three. Until then!

At insightify.io, we help organisations start their journey with building Data Lakes/Data Warehouses solutions, and then some. Don’t hesitate to contact us to get more info!